feat: visualization of conversation affinity in dashboard
This commit is contained in:
parent
4acbaeb29c
commit
aa7ec6354a
5 changed files with 306 additions and 19 deletions
26
config.yaml
26
config.yaml
|
|
@ -27,14 +27,24 @@ max_concurrent_connections: 2
|
|||
# priority_routing: true
|
||||
|
||||
# Conversation affinity (optional, default: false).
|
||||
# Routes follow-up requests back to the endpoint that previously served the
|
||||
# same conversation so the llama.cpp / Ollama prompt cache (KV cache) stays
|
||||
# warm — first turn does a cold prefill, follow-ups skip it. Soft preference:
|
||||
# falls back to the standard algorithm when the affine endpoint no longer has
|
||||
# the model loaded or has no free slot. Conversations are fingerprinted by
|
||||
# (model, first system + first user turn).
|
||||
# conversation_affinity: true
|
||||
# conversation_affinity_ttl: 300 # seconds; matches Ollama's default keep_alive
|
||||
# Pins a conversation to the endpoint that served its first turn so the
|
||||
# llama.cpp / Ollama prompt cache (KV cache) stays warm — first turn pays
|
||||
# the cold prefill, every follow-up turn reuses the same prefix.
|
||||
#
|
||||
# Fingerprint = sha1(model + leading system messages + first user turn).
|
||||
# Same chat → same fingerprint on every follow-up turn → same pin, TTL
|
||||
# refreshed on each reuse. Soft preference: if the pinned endpoint no
|
||||
# longer has the model loaded or has no free slot, the standard algorithm
|
||||
# takes over (no failure, just a cache miss).
|
||||
#
|
||||
# Heads-up: most chat UIs (Open WebUI, LibreChat, …) fire side requests for
|
||||
# title / tag / follow-up generation. Those have their own first turn and
|
||||
# therefore their own pin, so a single visible "chat" may show several dots
|
||||
# in the dashboard's Affinity column. That is correct — each pin matches a
|
||||
# real warm KV prefix on the backend. See doc/configuration.md for details.
|
||||
conversation_affinity: true
|
||||
conversation_affinity_ttl: 300 # seconds of inactivity before a pin expires;
|
||||
# bumped on every reuse. Matches Ollama's default keep_alive.
|
||||
|
||||
# Optional router-level API key that gates router/API/web UI access (leave empty to disable)
|
||||
nomyo-router-api-key: ""
|
||||
|
|
|
|||
|
|
@ -166,6 +166,91 @@ With this config the primary handles up to 4 concurrent requests before the seco
|
|||
|
||||
---
|
||||
|
||||
### `conversation_affinity`
|
||||
|
||||
**Type**: `bool` (optional)
|
||||
|
||||
**Default**: `false`
|
||||
|
||||
**Companion setting**: [`conversation_affinity_ttl`](#conversation_affinity_ttl)
|
||||
|
||||
**Description**: When enabled, the router prefers to send follow-up requests of the same conversation back to the endpoint that already served the first turn. This keeps the backend's prompt cache (the llama.cpp / Ollama **KV cache**) warm: the first user turn pays the cold prefill cost, every later turn reuses the same prefix and only generates new tokens. It is a **soft preference** — when the previously-chosen endpoint is no longer eligible (model unloaded, no free slot), the router falls back to the standard selection algorithm (`priority_routing` or random).
|
||||
|
||||
#### How a conversation is identified
|
||||
|
||||
The router does **not** track session IDs or auth tokens. It computes a stable fingerprint per request from:
|
||||
|
||||
```
|
||||
SHA1( model
|
||||
+ every leading message with role="system"
|
||||
+ the first message with role="user" )
|
||||
```
|
||||
|
||||
Anything after the first user turn is ignored — those later messages extend the same KV prefix, so they don't change the cache identity.
|
||||
|
||||
**What this means in practice**
|
||||
|
||||
| You send… | Fingerprint behaves like… |
|
||||
|---|---|
|
||||
| Turn 2 of the same chat (history grows but first system+user are unchanged) | **Same** as turn 1 → pin is reused and TTL refreshed |
|
||||
| Turn 1 of a fresh chat | **New** fingerprint → new pin |
|
||||
| Same first user prompt but a different model | **New** fingerprint (model is part of the hash) |
|
||||
| Same chat but the client mutates the system prompt between turns (e.g. injects a fresh timestamp) | **New** fingerprint — the affinity will not stick |
|
||||
|
||||
#### TTL and refresh
|
||||
|
||||
Every time `choose_endpoint` returns a pinned endpoint, the entry's expiry is bumped to `now + conversation_affinity_ttl`. An idle conversation drops out of the map once that window elapses without traffic. Default 300 s matches Ollama's default `keep_alive` — once the backend has unloaded the model, the KV cache is gone too, so a stale pin would be pointless anyway.
|
||||
|
||||
#### Why the dashboard may show more than one dot per visible conversation
|
||||
|
||||
The fingerprint is computed per **HTTP request**, not per chat-window. Most chat UIs (Open WebUI in particular) fire several **auxiliary** requests alongside the real conversation:
|
||||
|
||||
- *Title generation* — synthetic system prompt + the user message as content
|
||||
- *Follow-up question suggestion* — synthetic system prompt + the conversation as content
|
||||
- *Tag generation*, *memory extraction*, *retrieval query rewriting*, etc.
|
||||
|
||||
Each of those has its own `(system + first user turn)` and therefore its own fingerprint and its own pin in [the affinity dot matrix](monitoring.md#affinity-stats-conversation-affinity). They all *correctly* refer to a real warm KV-cache prefix on the backend, so the routing they drive is right — they just don't visually map 1:1 to a user-perceived "conversation."
|
||||
|
||||
#### Example
|
||||
|
||||
```yaml
|
||||
endpoints:
|
||||
- http://gpu-primary:11434
|
||||
- http://gpu-secondary:11434
|
||||
|
||||
conversation_affinity: true
|
||||
conversation_affinity_ttl: 300
|
||||
```
|
||||
|
||||
With this configuration, a chat that starts on `gpu-primary` will keep returning to `gpu-primary` for follow-up turns as long as the model is still loaded there and a slot is free, even if `gpu-secondary` happens to be more idle at that moment. Cold-prefill cost is paid once instead of once per turn.
|
||||
|
||||
#### When to enable
|
||||
|
||||
- ✅ Interactive chat workloads with long histories — the prefill savings on every follow-up turn are substantial.
|
||||
- ✅ Multi-endpoint deployments where models are loaded on more than one node.
|
||||
- ❌ Pure one-shot / single-turn workloads (no KV-cache to keep warm).
|
||||
- ❌ When you specifically want strict load-balancing parity — affinity intentionally biases against perfect balance.
|
||||
|
||||
---
|
||||
|
||||
### `conversation_affinity_ttl`
|
||||
|
||||
**Type**: `int` (seconds, optional)
|
||||
|
||||
**Default**: `300`
|
||||
|
||||
**Description**: How long a conversation stays pinned to its endpoint after the last request that touched it. Refreshed on every reuse — so an actively-used conversation keeps its pin indefinitely; an abandoned one expires after `conversation_affinity_ttl` seconds of silence.
|
||||
|
||||
**Recommendation**: leave this aligned with the backend's `keep_alive` window. If the model is unloaded by the backend, the KV cache is gone and there is no benefit to keeping the pin.
|
||||
|
||||
**Example**:
|
||||
```yaml
|
||||
conversation_affinity: true
|
||||
conversation_affinity_ttl: 600 # half an hour of inactivity before un-pinning
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### `router_api_key`
|
||||
|
||||
**Type**: `str` (optional)
|
||||
|
|
|
|||
|
|
@ -166,6 +166,39 @@ curl -X POST http://localhost:12434/api/cache/invalidate
|
|||
|
||||
Clears all cached entries and resets hit/miss counters.
|
||||
|
||||
### Affinity Stats (Conversation Affinity)
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/affinity_stats
|
||||
```
|
||||
|
||||
Response when [`conversation_affinity`](configuration.md#conversation_affinity) is enabled:
|
||||
|
||||
```json
|
||||
{
|
||||
"enabled": true,
|
||||
"ttl": 300,
|
||||
"entries": [
|
||||
{ "endpoint": "http://gpu-primary:11434", "model": "llama3.2:latest", "remaining": 287.4 },
|
||||
{ "endpoint": "http://gpu-primary:11434", "model": "llama3.2:latest", "remaining": 113.0 },
|
||||
{ "endpoint": "http://gpu-secondary:11434", "model": "qwen2.5-coder:7b", "remaining": 44.8 }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Response when the feature is disabled:
|
||||
```json
|
||||
{ "enabled": false, "ttl": 300, "entries": [] }
|
||||
```
|
||||
|
||||
- One element per **live pinned conversation** (no fingerprints or content — just the endpoint/model the pin points to and how many seconds it has left before expiry).
|
||||
- Aggregation by `(endpoint, model)` is left to the consumer: the dashboard does this client-side.
|
||||
- The endpoint is gated by the same `nomyo-router-api-key` middleware as the rest of `/api/*`.
|
||||
|
||||
The dashboard's **Running Models (PS) → Affinity** column is rendered from this data. The column auto-hides when `enabled: false`. Each row shows one dot per live pin against that `(endpoint, model)` pair; dot opacity = `remaining / ttl` (floor 0.15), so freshly-routed pins are solid and pins close to expiry fade out. A `+N` overflow badge appears once a single (endpoint, model) holds more than 12 active pins; an em-dash (`—`) marks an `(endpoint, model)` with no live pins.
|
||||
|
||||
> Multiple dots for what looks like "one chat window" is normal — most chat UIs (Open WebUI, LibreChat, …) fire auxiliary requests (title generation, follow-up suggestions, tag extraction) that have their own first-turn fingerprint and therefore their own pin. See [Conversation Affinity → Why the dashboard may show more than one dot per visible conversation](configuration.md#conversation_affinity) for the details.
|
||||
|
||||
### Real-time Usage Stream
|
||||
|
||||
```bash
|
||||
|
|
|
|||
47
router.py
47
router.py
|
|
@ -445,10 +445,12 @@ token_usage_counts: Dict[str, Dict[str, int]] = defaultdict(lambda: defaultdict(
|
|||
usage_lock = asyncio.Lock() # protects access to usage_counts
|
||||
token_usage_lock = asyncio.Lock()
|
||||
|
||||
# Conversation affinity map: fingerprint -> (endpoint, expires_at_monotonic).
|
||||
# Conversation affinity map: fingerprint -> (endpoint, model, expires_at_monotonic).
|
||||
# Keeps the same conversation pinned to the endpoint that already has its
|
||||
# KV-cache prefix warm. Never held together with usage_lock.
|
||||
_affinity_map: Dict[str, tuple[str, float]] = {}
|
||||
# KV-cache prefix warm. Model is stored so the dashboard can aggregate live
|
||||
# entries per (endpoint, model) without recomputing fingerprints.
|
||||
# Never held together with usage_lock.
|
||||
_affinity_map: Dict[str, tuple[str, str, float]] = {}
|
||||
_affinity_lock = asyncio.Lock()
|
||||
_AFFINITY_MAX_ENTRIES = 10000
|
||||
|
||||
|
|
@ -1859,7 +1861,7 @@ async def choose_endpoint(model: str, reserve: bool = True,
|
|||
async with _affinity_lock:
|
||||
entry = _affinity_map.get(affinity_key)
|
||||
if entry is not None:
|
||||
ep, expires_at = entry
|
||||
ep, _stored_model, expires_at = entry
|
||||
if expires_at < time.monotonic():
|
||||
_affinity_map.pop(affinity_key, None)
|
||||
else:
|
||||
|
|
@ -1961,10 +1963,10 @@ async def choose_endpoint(model: str, reserve: bool = True,
|
|||
if reserve and config.conversation_affinity and affinity_key:
|
||||
expires_at = time.monotonic() + config.conversation_affinity_ttl
|
||||
async with _affinity_lock:
|
||||
_affinity_map[affinity_key] = (selected, expires_at)
|
||||
_affinity_map[affinity_key] = (selected, model, expires_at)
|
||||
if len(_affinity_map) > _AFFINITY_MAX_ENTRIES:
|
||||
now = time.monotonic()
|
||||
for k in [k for k, v in _affinity_map.items() if v[1] < now]:
|
||||
for k in [k for k, v in _affinity_map.items() if v[2] < now]:
|
||||
_affinity_map.pop(k, None)
|
||||
return selected, tracking_model
|
||||
|
||||
|
|
@ -3103,6 +3105,39 @@ async def ps_details_proxy(request: Request):
|
|||
|
||||
return JSONResponse(content={"models": models}, status_code=200)
|
||||
|
||||
# -------------------------------------------------------------
|
||||
# 18b. Conversation-affinity stats – feeds the PS-table dot matrix
|
||||
# -------------------------------------------------------------
|
||||
@app.get("/api/affinity_stats")
|
||||
async def affinity_stats(request: Request):
|
||||
"""
|
||||
Aggregate live conversation-affinity pins, one entry per pinned conversation.
|
||||
Each entry exposes only the endpoint, model, and remaining TTL in seconds —
|
||||
no fingerprints or content. When conversation_affinity is disabled the
|
||||
`entries` list is always empty.
|
||||
"""
|
||||
if not config.conversation_affinity:
|
||||
return {"enabled": False, "ttl": config.conversation_affinity_ttl, "entries": []}
|
||||
|
||||
now = time.monotonic()
|
||||
entries: list[dict] = []
|
||||
async with _affinity_lock:
|
||||
for fp, (ep, mdl, expires_at) in list(_affinity_map.items()):
|
||||
remaining = expires_at - now
|
||||
if remaining <= 0:
|
||||
_affinity_map.pop(fp, None)
|
||||
continue
|
||||
entries.append({
|
||||
"endpoint": ep,
|
||||
"model": mdl,
|
||||
"remaining": round(remaining, 2),
|
||||
})
|
||||
return {
|
||||
"enabled": True,
|
||||
"ttl": config.conversation_affinity_ttl,
|
||||
"entries": entries,
|
||||
}
|
||||
|
||||
# -------------------------------------------------------------
|
||||
# 19. Proxy usage route – for monitoring
|
||||
# -------------------------------------------------------------
|
||||
|
|
|
|||
|
|
@ -121,6 +121,45 @@
|
|||
.ps-subrow + .ps-subrow {
|
||||
margin-top: 2px;
|
||||
}
|
||||
#ps-table .affinity-col,
|
||||
#ps-table .affinity-cell {
|
||||
display: none;
|
||||
}
|
||||
#ps-table.affinity-on .affinity-col,
|
||||
#ps-table.affinity-on .affinity-cell {
|
||||
display: table-cell;
|
||||
width: 90px;
|
||||
text-align: center;
|
||||
padding-left: 6px;
|
||||
padding-right: 6px;
|
||||
}
|
||||
#ps-table.affinity-on .affinity-dots {
|
||||
max-width: 78px;
|
||||
}
|
||||
.affinity-dots {
|
||||
display: inline-flex;
|
||||
flex-wrap: wrap;
|
||||
gap: 3px;
|
||||
align-items: center;
|
||||
line-height: 1;
|
||||
}
|
||||
.affinity-dot {
|
||||
width: 8px;
|
||||
height: 8px;
|
||||
border-radius: 50%;
|
||||
background: #2e7d32;
|
||||
display: inline-block;
|
||||
transition: opacity 1s linear;
|
||||
}
|
||||
.affinity-overflow {
|
||||
font-size: 10px;
|
||||
color: #555;
|
||||
margin-left: 2px;
|
||||
}
|
||||
.affinity-empty {
|
||||
color: #bbb;
|
||||
font-size: 11px;
|
||||
}
|
||||
#ps-table {
|
||||
width: max-content;
|
||||
min-width: 100%;
|
||||
|
|
@ -131,13 +170,13 @@
|
|||
max-width: 300px;
|
||||
white-space: nowrap;
|
||||
}
|
||||
/* Optimize narrow columns */
|
||||
#ps-table th:nth-child(3),
|
||||
#ps-table td:nth-child(3),
|
||||
/* Optimize narrow columns (Params / Quant / Ctx) */
|
||||
#ps-table th:nth-child(4),
|
||||
#ps-table td:nth-child(4),
|
||||
#ps-table th:nth-child(5),
|
||||
#ps-table td:nth-child(5) {
|
||||
#ps-table td:nth-child(5),
|
||||
#ps-table th:nth-child(6),
|
||||
#ps-table td:nth-child(6) {
|
||||
width: 80px;
|
||||
text-align: center;
|
||||
}
|
||||
|
|
@ -395,6 +434,7 @@
|
|||
<tr>
|
||||
<th class="model-col">Model</th>
|
||||
<th>Endpoint</th>
|
||||
<th class="affinity-col" title="Live conversation-affinity pins (KV-cache warm). One dot per pinned conversation; opacity fades toward TTL expiry.">Affinity</th>
|
||||
<th>Params</th>
|
||||
<th>Quant</th>
|
||||
<th>Ctx</th>
|
||||
|
|
@ -406,7 +446,7 @@
|
|||
</thead>
|
||||
<tbody id="ps-body">
|
||||
<tr>
|
||||
<td colspan="6" class="loading">Loading…</td>
|
||||
<td colspan="10" class="loading">Loading…</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
|
@ -932,6 +972,14 @@ function renderTimeSeriesChart(timeSeriesData, chart, minutes) {
|
|||
return items.map((item) => `<div class="ps-subrow">${item || ""}</div>`).join("");
|
||||
};
|
||||
|
||||
const escapeAttr = (s) => String(s).replace(/&/g, "&").replace(/"/g, """).replace(/</g, "<").replace(/>/g, ">");
|
||||
const renderAffinitySlots = (endpoints, modelName) => {
|
||||
if (!endpoints.length) return "";
|
||||
return endpoints
|
||||
.map((ep) => `<div class="ps-subrow"><span class="affinity-dots" data-endpoint="${escapeAttr(ep)}" data-model="${escapeAttr(modelName)}"></span></div>`)
|
||||
.join("");
|
||||
};
|
||||
|
||||
body.innerHTML = Array.from(grouped.entries())
|
||||
.map(([modelName, modelInstances]) => {
|
||||
const existingRow = psRows.get(modelName);
|
||||
|
|
@ -955,6 +1003,7 @@ function renderTimeSeriesChart(timeSeriesData, chart, minutes) {
|
|||
return `<tr data-model="${modelName}" data-endpoints="${endpointsData}">
|
||||
<td class="model"><span style="color:${getColor(modelName)}">${modelName}</span> <a href="#" class="stats-link" data-model="${modelName}">stats</a></td>
|
||||
<td>${renderInstanceList(endpoints)}</td>
|
||||
<td class="affinity-cell">${renderAffinitySlots(endpoints, modelName)}</td>
|
||||
<td>${params}</td>
|
||||
<td>${quant}</td>
|
||||
<td>${ctx}</td>
|
||||
|
|
@ -972,11 +1021,83 @@ function renderTimeSeriesChart(timeSeriesData, chart, minutes) {
|
|||
const model = row.dataset.model;
|
||||
if (model) psRows.set(model, row);
|
||||
});
|
||||
renderAffinityDots();
|
||||
} catch (e) {
|
||||
console.error(e);
|
||||
}
|
||||
}
|
||||
|
||||
/* ---------- Conversation-affinity dots ---------- */
|
||||
const AFFINITY_MAX_DOTS = 12;
|
||||
let affinityIndex = new Map(); // `${endpoint}|${model}` -> array of {expiresAt}
|
||||
let affinityTtl = 300;
|
||||
let affinityEnabled = false;
|
||||
|
||||
async function loadAffinity() {
|
||||
try {
|
||||
const data = await fetchJSON("/api/affinity_stats");
|
||||
affinityEnabled = !!data.enabled;
|
||||
affinityTtl = Number(data.ttl) || 300;
|
||||
const now = Date.now() / 1000;
|
||||
const idx = new Map();
|
||||
for (const e of data.entries || []) {
|
||||
const key = `${e.endpoint}|${e.model}`;
|
||||
if (!idx.has(key)) idx.set(key, []);
|
||||
idx.get(key).push({ expiresAt: now + Number(e.remaining) });
|
||||
}
|
||||
affinityIndex = idx;
|
||||
applyAffinityColumnVisibility();
|
||||
renderAffinityDots();
|
||||
} catch (err) {
|
||||
// Endpoint may 404 on older deployments — silently degrade.
|
||||
affinityEnabled = false;
|
||||
affinityIndex = new Map();
|
||||
applyAffinityColumnVisibility();
|
||||
renderAffinityDots();
|
||||
}
|
||||
}
|
||||
|
||||
function applyAffinityColumnVisibility() {
|
||||
const table = document.getElementById("ps-table");
|
||||
if (!table) return;
|
||||
table.classList.toggle("affinity-on", affinityEnabled);
|
||||
}
|
||||
|
||||
function renderAffinityDots() {
|
||||
const spans = document.querySelectorAll(".affinity-dots");
|
||||
if (!spans.length) return;
|
||||
const now = Date.now() / 1000;
|
||||
spans.forEach((span) => {
|
||||
const ep = span.dataset.endpoint;
|
||||
const mdl = span.dataset.model;
|
||||
const key = `${ep}|${mdl}`;
|
||||
const pins = (affinityIndex.get(key) || []).filter((p) => p.expiresAt > now);
|
||||
if (pins.length !== (affinityIndex.get(key) || []).length) {
|
||||
if (pins.length) affinityIndex.set(key, pins);
|
||||
else affinityIndex.delete(key);
|
||||
}
|
||||
if (!pins.length) {
|
||||
span.innerHTML = affinityEnabled
|
||||
? `<span class="affinity-empty">—</span>`
|
||||
: "";
|
||||
return;
|
||||
}
|
||||
// Sort freshest first so visible dots are the most "recent".
|
||||
pins.sort((a, b) => b.expiresAt - a.expiresAt);
|
||||
const visible = pins.slice(0, AFFINITY_MAX_DOTS);
|
||||
const overflow = pins.length - visible.length;
|
||||
const dotsHtml = visible
|
||||
.map((p) => {
|
||||
const remaining = Math.max(0, p.expiresAt - now);
|
||||
const opacity = Math.max(0.15, Math.min(1, remaining / affinityTtl));
|
||||
const secs = Math.round(remaining);
|
||||
return `<span class="affinity-dot" style="opacity:${opacity.toFixed(2)}" title="pin expires in ${secs}s"></span>`;
|
||||
})
|
||||
.join("");
|
||||
span.innerHTML = dotsHtml + (overflow > 0 ? `<span class="affinity-overflow">+${overflow}</span>` : "");
|
||||
});
|
||||
}
|
||||
|
||||
/* ---------- Usage Chart (stacked‑percentage) ---------- */
|
||||
function getColor(seed) {
|
||||
const h = Math.abs(hashString(seed) % 360);
|
||||
|
|
@ -1173,10 +1294,13 @@ function renderTimeSeriesChart(timeSeriesData, chart, minutes) {
|
|||
loadEndpoints();
|
||||
loadTags();
|
||||
loadPS();
|
||||
loadAffinity();
|
||||
loadUsage();
|
||||
initHeaderChart();
|
||||
setInterval(tickTpsChart, 1000);
|
||||
setInterval(loadPS, 60_000);
|
||||
setInterval(loadAffinity, 15_000);
|
||||
setInterval(renderAffinityDots, 2_000);
|
||||
setInterval(loadEndpoints, 300_000);
|
||||
|
||||
/* show logic */
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue