* fix(routing): auto-migrate v0.3.0 inline routing_preferences to v0.4.0 top-level Lift inline routing_preferences under each model_provider into the top-level routing_preferences list with merged models[] and bump version to v0.4.0, with a deprecation warning. Existing v0.3.0 demo configs (Claude Code, Codex, preference_based_routing, etc.) keep working unchanged. Schema flags the inline shape as deprecated but still accepts it. Docs and skills updated to canonical top-level multi-model form. * test(common): bump reference config assertion to v0.4.0 The rendered reference config was bumped to v0.4.0 when its inline routing_preferences were lifted to the top level; align the configuration deserialization test with that change. * fix(config_generator): bump version to v0.4.0 up front in migration Move the v0.3.0 -> v0.4.0 version bump to the top of migrate_inline_routing_preferences so it runs unconditionally, including for configs that already declare top-level routing_preferences at v0.3.0. Previously the bump only fired when inline migration produced entries, leaving top-level v0.3.0 configs rejected by brightstaff's v0.4.0 gate. Tests updated to cover the new behavior and to confirm we never downgrade newer versions. * fix(config_generator): gate routing_preferences migration on version < v0.4.0 Short-circuit the migration when the config already declares v0.4.0 or newer. Anything at v0.4.0+ is assumed to be on the canonical top-level shape and is passed through untouched, including stray inline preferences (which are the author's bug to fix). Only v0.3.0 and older configs are rewritten and bumped.
5.3 KiB
Plano Routing API — Request & Response Format
Overview
Plano intercepts LLM requests and routes them to the best available model based on semantic intent and live cost/latency data. The developer sends a standard OpenAI-compatible request with an optional routing_preferences field. Plano returns an ordered list of candidate models; the client uses the first and falls back to the next on 429 or 5xx errors.
Request Format
Standard OpenAI chat completion body. The only addition is the optional routing_preferences field, which is stripped before the request is forwarded upstream.
POST /v1/chat/completions
{
"model": "openai/gpt-4o-mini",
"messages": [
{"role": "user", "content": "write a sorting algorithm in Python"}
],
"routing_preferences": [
{
"name": "code generation",
"description": "generating new code snippets",
"models": ["anthropic/claude-sonnet-4-20250514", "openai/gpt-4o", "openai/gpt-4o-mini"]
},
{
"name": "general questions",
"description": "casual conversation and simple queries",
"models": ["openai/gpt-4o-mini"]
}
]
}
routing_preferences fields
| Field | Type | Required | Description |
|---|---|---|---|
name |
string | yes | Route identifier. Must match the LLM router's route classification. |
description |
string | yes | Natural language description used by the router to match user intent. |
models |
string[] | yes | Ordered candidate pool. At least one entry required. Must be declared in model_providers. |
Notes
routing_preferencesis optional. If omitted, the config-defined preferences are used.- If provided in the request body, it overrides the config for that single request only.
modelis still required and is used as the fallback if no route is matched.
Response Format
{
"models": [
"anthropic/claude-sonnet-4-20250514",
"openai/gpt-4o",
"openai/gpt-4o-mini"
],
"route": "code generation",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736"
}
Fields
| Field | Type | Description |
|---|---|---|
models |
string[] | Ranked model list. Use models[0] as primary; retry with models[1] on 429/5xx, and so on. |
route |
string | null |
trace_id |
string | Trace ID for distributed tracing and observability. |
Client Usage Pattern
response = plano.routing_decision(request)
models = response["models"]
for model in models:
try:
result = call_llm(model, messages)
break # success — stop trying
except (RateLimitError, ServerError):
continue # try next model in the ranked list
Configuration (set by platform/ops team)
Requires version: v0.4.0 or above. Models listed under routing_preferences must be declared in model_providers.
version: v0.4.0
model_providers:
- model: anthropic/claude-sonnet-4-20250514
access_key: $ANTHROPIC_API_KEY
- model: openai/gpt-4o
access_key: $OPENAI_API_KEY
- model: openai/gpt-4o-mini
access_key: $OPENAI_API_KEY
default: true
routing_preferences:
- name: code generation
description: generating new code snippets or boilerplate
models:
- anthropic/claude-sonnet-4-20250514
- openai/gpt-4o
- name: general questions
description: casual conversation and simple queries
models:
- openai/gpt-4o-mini
- openai/gpt-4o
Model Affinity
In agentic loops where the same session makes multiple LLM calls, send an X-Model-Affinity header to pin the routing decision. The first request routes normally and caches the result. All subsequent requests with the same affinity ID return the cached model without re-running routing.
POST /v1/chat/completions
X-Model-Affinity: a1b2c3d4-5678-...
{
"model": "openai/gpt-4o-mini",
"messages": [...]
}
The routing decision endpoint also supports model affinity:
POST /routing/v1/chat/completions
X-Model-Affinity: a1b2c3d4-5678-...
Response when pinned:
{
"models": ["anthropic/claude-sonnet-4-20250514"],
"route": "code generation",
"trace_id": "...",
"session_id": "a1b2c3d4-5678-...",
"pinned": true
}
Without the header, routing runs fresh every time (no breaking change).
Configure TTL and cache size:
routing:
session_ttl_seconds: 600 # default: 10 min
session_max_entries: 10000 # upper limit
Version Requirements
| Version | Top-level routing_preferences |
|---|---|
< v0.4.0 |
Not allowed — startup error if present |
v0.4.0+ |
Supported (required for model routing) |