Model affinity for consistent model selection in agentic loops (#827)

2026-06-08 14:55:14 +02:00 · 2026-04-08 17:32:02 -07:00 · 2026-04-08 17:32:02 -07:00 · 8dedf0bec1
commit 8dedf0bec1
parent 978b1ea722
13 changed files with 614 additions and 43 deletions
--- a/docs/routing-api.md
+++ b/docs/routing-api.md
@ -120,6 +120,49 @@ routing_preferences:

 ---

+## Model Affinity
+
+In agentic loops where the same session makes multiple LLM calls, send an `X-Model-Affinity` header to pin the routing decision. The first request routes normally and caches the result. All subsequent requests with the same affinity ID return the cached model without re-running routing.
+
+```json
+POST /v1/chat/completions
+X-Model-Affinity: a1b2c3d4-5678-...
+
+{
+  "model": "openai/gpt-4o-mini",
+  "messages": [...]
+}
+```
+
+The routing decision endpoint also supports model affinity:
+
+```json
+POST /routing/v1/chat/completions
+X-Model-Affinity: a1b2c3d4-5678-...
+```
+
+Response when pinned:
+```json
+{
+  "models": ["anthropic/claude-sonnet-4-20250514"],
+  "route": "code generation",
+  "trace_id": "...",
+  "session_id": "a1b2c3d4-5678-...",
+  "pinned": true
+}
+```
+
+Without the header, routing runs fresh every time (no breaking change).
+
+Configure TTL and cache size:
+```yaml
+routing:
+  session_ttl_seconds: 600    # default: 10 min
+  session_max_entries: 10000  # upper limit
+```
+
+---
+
 ## Version Requirements

 | Version | Top-level `routing_preferences` |
--- a/docs/source/guides/llm_router.rst
+++ b/docs/source/guides/llm_router.rst
@ -376,6 +376,44 @@ For the canonical Plano Kubernetes deployment (ConfigMap, Secrets, Deployment YA
 `demo README <https://github.com/katanemo/plano/tree/main/demos/llm_routing/model_routing_service/README.md>`_.


+.. _model_affinity:
+
+Model Affinity
+--------------
+
+In agentic loops — where a single user request triggers multiple LLM calls through tool use — Plano's router classifies each turn independently. Because successive prompts differ in intent (tool selection looks like code generation, reasoning about results looks like analysis), the router may select different models mid-session. This causes behavioral inconsistency and invalidates provider-side KV caches, increasing both latency and cost.
+
+**Model affinity** pins the routing decision for the duration of a session. Send an ``X-Model-Affinity`` header with any string identifier (typically a UUID). The first request routes normally and caches the result. All subsequent requests with the same affinity ID skip routing and reuse the cached model.
+
+.. code-block:: python
+
+    import uuid
+    from openai import OpenAI
+
+    client = OpenAI(base_url="http://localhost:12000/v1", api_key="EMPTY")
+    affinity_id = str(uuid.uuid4())
+
+    # Every call in the loop uses the same header
+    response = client.chat.completions.create(
+        model="gpt-4o-mini",
+        messages=messages,
+        tools=tools,
+        extra_headers={"X-Model-Affinity": affinity_id},
+    )
+
+Without the header, routing runs fresh on every request — no behavior change for existing clients.
+
+**Configuration:**
+
+.. code-block:: yaml
+
+    routing:
+      session_ttl_seconds: 600    # How long affinity lasts (default: 10 min)
+      session_max_entries: 10000  # Max cached sessions (upper limit: 10000)
+
+To start a new routing decision (e.g., when the agent's task changes), generate a new affinity ID.
+
+
 Combining Routing Methods
 -------------------------

--- a/docs/source/resources/includes/plano_config_full_reference.yaml
+++ b/docs/source/resources/includes/plano_config_full_reference.yaml
@ -174,6 +174,11 @@ overrides:
  # Model used for agent orchestration (must be listed in model_providers)
  agent_orchestration_model: Plano-Orchestrator

+# Model affinity — pin routing decisions for agentic loops
+routing:
+  session_ttl_seconds: 600    # How long a pinned session lasts (default: 600s / 10 min)
+  session_max_entries: 10000  # Max cached sessions before eviction (upper limit: 10000)
+
 # State storage for multi-turn conversation history
 state_storage:
  type: memory            # "memory" (in-process) or "postgres" (persistent)
--- a/docs/source/resources/includes/plano_config_full_reference_rendered.yaml
+++ b/docs/source/resources/includes/plano_config_full_reference_rendered.yaml
@ -215,6 +215,9 @@ ratelimits:
  selector:
    key: x-org-id
    value: acme-corp
+routing:
+  session_max_entries: 10000
+  session_ttl_seconds: 600
 state_storage:
  type: memory
 system_prompt: 'You are a helpful assistant. Always respond concisely and accurately.