diff --git a/docs/routing-api.md b/docs/routing-api.md index 0b30d627..c2b9c63f 100644 --- a/docs/routing-api.md +++ b/docs/routing-api.md @@ -120,6 +120,49 @@ routing_preferences: --- +## Model Affinity + +In agentic loops where the same session makes multiple LLM calls, send an `X-Model-Affinity` header to pin the routing decision. The first request routes normally and caches the result. All subsequent requests with the same affinity ID return the cached model without re-running routing. + +```json +POST /v1/chat/completions +X-Model-Affinity: a1b2c3d4-5678-... + +{ + "model": "openai/gpt-4o-mini", + "messages": [...] +} +``` + +The routing decision endpoint also supports model affinity: + +```json +POST /routing/v1/chat/completions +X-Model-Affinity: a1b2c3d4-5678-... +``` + +Response when pinned: +```json +{ + "models": ["anthropic/claude-sonnet-4-20250514"], + "route": "code generation", + "trace_id": "...", + "session_id": "a1b2c3d4-5678-...", + "pinned": true +} +``` + +Without the header, routing runs fresh every time (no breaking change). + +Configure TTL and cache size: +```yaml +routing: + session_ttl_seconds: 600 # default: 10 min + session_max_entries: 10000 # upper limit +``` + +--- + ## Version Requirements | Version | Top-level `routing_preferences` | diff --git a/docs/source/guides/llm_router.rst b/docs/source/guides/llm_router.rst index 7c4ad685..f294043a 100644 --- a/docs/source/guides/llm_router.rst +++ b/docs/source/guides/llm_router.rst @@ -376,6 +376,44 @@ For the canonical Plano Kubernetes deployment (ConfigMap, Secrets, Deployment YA `demo README `_. +.. _model_affinity: + +Model Affinity +-------------- + +In agentic loops — where a single user request triggers multiple LLM calls through tool use — Plano's router classifies each turn independently. Because successive prompts differ in intent (tool selection looks like code generation, reasoning about results looks like analysis), the router may select different models mid-session. This causes behavioral inconsistency and invalidates provider-side KV caches, increasing both latency and cost. + +**Model affinity** pins the routing decision for the duration of a session. Send an ``X-Model-Affinity`` header with any string identifier (typically a UUID). The first request routes normally and caches the result. All subsequent requests with the same affinity ID skip routing and reuse the cached model. + +.. code-block:: python + + import uuid + from openai import OpenAI + + client = OpenAI(base_url="http://localhost:12000/v1", api_key="EMPTY") + affinity_id = str(uuid.uuid4()) + + # Every call in the loop uses the same header + response = client.chat.completions.create( + model="gpt-4o-mini", + messages=messages, + tools=tools, + extra_headers={"X-Model-Affinity": affinity_id}, + ) + +Without the header, routing runs fresh on every request — no behavior change for existing clients. + +**Configuration:** + +.. code-block:: yaml + + routing: + session_ttl_seconds: 600 # How long affinity lasts (default: 10 min) + session_max_entries: 10000 # Max cached sessions (upper limit: 10000) + +To start a new routing decision (e.g., when the agent's task changes), generate a new affinity ID. + + Combining Routing Methods ------------------------- diff --git a/docs/source/resources/includes/plano_config_full_reference.yaml b/docs/source/resources/includes/plano_config_full_reference.yaml index 452bc17a..787b09d3 100644 --- a/docs/source/resources/includes/plano_config_full_reference.yaml +++ b/docs/source/resources/includes/plano_config_full_reference.yaml @@ -174,6 +174,11 @@ overrides: # Model used for agent orchestration (must be listed in model_providers) agent_orchestration_model: Plano-Orchestrator +# Model affinity — pin routing decisions for agentic loops +routing: + session_ttl_seconds: 600 # How long a pinned session lasts (default: 600s / 10 min) + session_max_entries: 10000 # Max cached sessions before eviction (upper limit: 10000) + # State storage for multi-turn conversation history state_storage: type: memory # "memory" (in-process) or "postgres" (persistent)