add model affinity docs to llm_router guide, config reference, and routing API

2026-04-26 01:06:25 +02:00 · 2026-04-08 16:45:29 -07:00 · 2026-04-08 16:45:29 -07:00 · 53602f4788
commit 53602f4788
parent da9792c2dd
3 changed files with 86 additions and 0 deletions
--- a/docs/routing-api.md
+++ b/docs/routing-api.md
@ -120,6 +120,49 @@ routing_preferences:

 ---

+## Model Affinity
+
+In agentic loops where the same session makes multiple LLM calls, send an `X-Model-Affinity` header to pin the routing decision. The first request routes normally and caches the result. All subsequent requests with the same affinity ID return the cached model without re-running routing.
+
+```json
+POST /v1/chat/completions
+X-Model-Affinity: a1b2c3d4-5678-...
+
+{
+  "model": "openai/gpt-4o-mini",
+  "messages": [...]
+}
+```
+
+The routing decision endpoint also supports model affinity:
+
+```json
+POST /routing/v1/chat/completions
+X-Model-Affinity: a1b2c3d4-5678-...
+```
+
+Response when pinned:
+```json
+{
+  "models": ["anthropic/claude-sonnet-4-20250514"],
+  "route": "code generation",
+  "trace_id": "...",
+  "session_id": "a1b2c3d4-5678-...",
+  "pinned": true
+}
+```
+
+Without the header, routing runs fresh every time (no breaking change).
+
+Configure TTL and cache size:
+```yaml
+routing:
+  session_ttl_seconds: 600    # default: 10 min
+  session_max_entries: 10000  # upper limit
+```
+
+---
+
 ## Version Requirements

 | Version | Top-level `routing_preferences` |