Model affinity for consistent model selection in agentic loops (#827)
Some checks are pending
CI / pre-commit (push) Waiting to run
CI / plano-tools-tests (push) Waiting to run
CI / native-smoke-test (push) Waiting to run
CI / docker-build (push) Waiting to run
CI / validate-config (push) Waiting to run
CI / security-scan (push) Blocked by required conditions
CI / test-prompt-gateway (push) Blocked by required conditions
CI / test-model-alias-routing (push) Blocked by required conditions
CI / test-responses-api-with-state (push) Blocked by required conditions
CI / e2e-plano-tests (3.10) (push) Blocked by required conditions
CI / e2e-plano-tests (3.11) (push) Blocked by required conditions
CI / e2e-plano-tests (3.12) (push) Blocked by required conditions
CI / e2e-plano-tests (3.13) (push) Blocked by required conditions
CI / e2e-plano-tests (3.14) (push) Blocked by required conditions
CI / e2e-demo-preference (push) Blocked by required conditions
CI / e2e-demo-currency (push) Blocked by required conditions
Publish docker image (latest) / build-arm64 (push) Waiting to run
Publish docker image (latest) / build-amd64 (push) Waiting to run
Publish docker image (latest) / create-manifest (push) Blocked by required conditions
Build and Deploy Documentation / build (push) Waiting to run

This commit is contained in:
Adil Hafeez 2026-04-08 17:32:02 -07:00 committed by GitHub
parent 978b1ea722
commit 8dedf0bec1
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
13 changed files with 614 additions and 43 deletions

View file

@ -120,6 +120,49 @@ routing_preferences:
---
## Model Affinity
In agentic loops where the same session makes multiple LLM calls, send an `X-Model-Affinity` header to pin the routing decision. The first request routes normally and caches the result. All subsequent requests with the same affinity ID return the cached model without re-running routing.
```json
POST /v1/chat/completions
X-Model-Affinity: a1b2c3d4-5678-...
{
"model": "openai/gpt-4o-mini",
"messages": [...]
}
```
The routing decision endpoint also supports model affinity:
```json
POST /routing/v1/chat/completions
X-Model-Affinity: a1b2c3d4-5678-...
```
Response when pinned:
```json
{
"models": ["anthropic/claude-sonnet-4-20250514"],
"route": "code generation",
"trace_id": "...",
"session_id": "a1b2c3d4-5678-...",
"pinned": true
}
```
Without the header, routing runs fresh every time (no breaking change).
Configure TTL and cache size:
```yaml
routing:
session_ttl_seconds: 600 # default: 10 min
session_max_entries: 10000 # upper limit
```
---
## Version Requirements
| Version | Top-level `routing_preferences` |

View file

@ -376,6 +376,44 @@ For the canonical Plano Kubernetes deployment (ConfigMap, Secrets, Deployment YA
`demo README <https://github.com/katanemo/plano/tree/main/demos/llm_routing/model_routing_service/README.md>`_.
.. _model_affinity:
Model Affinity
--------------
In agentic loops — where a single user request triggers multiple LLM calls through tool use — Plano's router classifies each turn independently. Because successive prompts differ in intent (tool selection looks like code generation, reasoning about results looks like analysis), the router may select different models mid-session. This causes behavioral inconsistency and invalidates provider-side KV caches, increasing both latency and cost.
**Model affinity** pins the routing decision for the duration of a session. Send an ``X-Model-Affinity`` header with any string identifier (typically a UUID). The first request routes normally and caches the result. All subsequent requests with the same affinity ID skip routing and reuse the cached model.
.. code-block:: python
import uuid
from openai import OpenAI
client = OpenAI(base_url="http://localhost:12000/v1", api_key="EMPTY")
affinity_id = str(uuid.uuid4())
# Every call in the loop uses the same header
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=tools,
extra_headers={"X-Model-Affinity": affinity_id},
)
Without the header, routing runs fresh on every request — no behavior change for existing clients.
**Configuration:**
.. code-block:: yaml
routing:
session_ttl_seconds: 600 # How long affinity lasts (default: 10 min)
session_max_entries: 10000 # Max cached sessions (upper limit: 10000)
To start a new routing decision (e.g., when the agent's task changes), generate a new affinity ID.
Combining Routing Methods
-------------------------

View file

@ -174,6 +174,11 @@ overrides:
# Model used for agent orchestration (must be listed in model_providers)
agent_orchestration_model: Plano-Orchestrator
# Model affinity — pin routing decisions for agentic loops
routing:
session_ttl_seconds: 600 # How long a pinned session lasts (default: 600s / 10 min)
session_max_entries: 10000 # Max cached sessions before eviction (upper limit: 10000)
# State storage for multi-turn conversation history
state_storage:
type: memory # "memory" (in-process) or "postgres" (persistent)

View file

@ -215,6 +215,9 @@ ratelimits:
selector:
key: x-org-id
value: acme-corp
routing:
session_max_entries: 10000
session_ttl_seconds: 600
state_storage:
type: memory
system_prompt: 'You are a helpful assistant. Always respond concisely and accurately.