mirror of
https://github.com/katanemo/plano.git
synced 2026-04-25 00:36:34 +02:00
add model affinity docs to llm_router guide, config reference, and routing API
This commit is contained in:
parent
da9792c2dd
commit
53602f4788
3 changed files with 86 additions and 0 deletions
|
|
@ -120,6 +120,49 @@ routing_preferences:
|
|||
|
||||
---
|
||||
|
||||
## Model Affinity
|
||||
|
||||
In agentic loops where the same session makes multiple LLM calls, send an `X-Model-Affinity` header to pin the routing decision. The first request routes normally and caches the result. All subsequent requests with the same affinity ID return the cached model without re-running routing.
|
||||
|
||||
```json
|
||||
POST /v1/chat/completions
|
||||
X-Model-Affinity: a1b2c3d4-5678-...
|
||||
|
||||
{
|
||||
"model": "openai/gpt-4o-mini",
|
||||
"messages": [...]
|
||||
}
|
||||
```
|
||||
|
||||
The routing decision endpoint also supports model affinity:
|
||||
|
||||
```json
|
||||
POST /routing/v1/chat/completions
|
||||
X-Model-Affinity: a1b2c3d4-5678-...
|
||||
```
|
||||
|
||||
Response when pinned:
|
||||
```json
|
||||
{
|
||||
"models": ["anthropic/claude-sonnet-4-20250514"],
|
||||
"route": "code generation",
|
||||
"trace_id": "...",
|
||||
"session_id": "a1b2c3d4-5678-...",
|
||||
"pinned": true
|
||||
}
|
||||
```
|
||||
|
||||
Without the header, routing runs fresh every time (no breaking change).
|
||||
|
||||
Configure TTL and cache size:
|
||||
```yaml
|
||||
routing:
|
||||
session_ttl_seconds: 600 # default: 10 min
|
||||
session_max_entries: 10000 # upper limit
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Version Requirements
|
||||
|
||||
| Version | Top-level `routing_preferences` |
|
||||
|
|
|
|||
|
|
@ -376,6 +376,44 @@ For the canonical Plano Kubernetes deployment (ConfigMap, Secrets, Deployment YA
|
|||
`demo README <https://github.com/katanemo/plano/tree/main/demos/llm_routing/model_routing_service/README.md>`_.
|
||||
|
||||
|
||||
.. _model_affinity:
|
||||
|
||||
Model Affinity
|
||||
--------------
|
||||
|
||||
In agentic loops — where a single user request triggers multiple LLM calls through tool use — Plano's router classifies each turn independently. Because successive prompts differ in intent (tool selection looks like code generation, reasoning about results looks like analysis), the router may select different models mid-session. This causes behavioral inconsistency and invalidates provider-side KV caches, increasing both latency and cost.
|
||||
|
||||
**Model affinity** pins the routing decision for the duration of a session. Send an ``X-Model-Affinity`` header with any string identifier (typically a UUID). The first request routes normally and caches the result. All subsequent requests with the same affinity ID skip routing and reuse the cached model.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import uuid
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(base_url="http://localhost:12000/v1", api_key="EMPTY")
|
||||
affinity_id = str(uuid.uuid4())
|
||||
|
||||
# Every call in the loop uses the same header
|
||||
response = client.chat.completions.create(
|
||||
model="gpt-4o-mini",
|
||||
messages=messages,
|
||||
tools=tools,
|
||||
extra_headers={"X-Model-Affinity": affinity_id},
|
||||
)
|
||||
|
||||
Without the header, routing runs fresh on every request — no behavior change for existing clients.
|
||||
|
||||
**Configuration:**
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
routing:
|
||||
session_ttl_seconds: 600 # How long affinity lasts (default: 10 min)
|
||||
session_max_entries: 10000 # Max cached sessions (upper limit: 10000)
|
||||
|
||||
To start a new routing decision (e.g., when the agent's task changes), generate a new affinity ID.
|
||||
|
||||
|
||||
Combining Routing Methods
|
||||
-------------------------
|
||||
|
||||
|
|
|
|||
|
|
@ -174,6 +174,11 @@ overrides:
|
|||
# Model used for agent orchestration (must be listed in model_providers)
|
||||
agent_orchestration_model: Plano-Orchestrator
|
||||
|
||||
# Model affinity — pin routing decisions for agentic loops
|
||||
routing:
|
||||
session_ttl_seconds: 600 # How long a pinned session lasts (default: 600s / 10 min)
|
||||
session_max_entries: 10000 # Max cached sessions before eviction (upper limit: 10000)
|
||||
|
||||
# State storage for multi-turn conversation history
|
||||
state_storage:
|
||||
type: memory # "memory" (in-process) or "postgres" (persistent)
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue