Plano intercepts LLM requests and routes them to the best available model based on semantic intent and live cost/latency data. The developer sends a standard OpenAI-compatible request with an optional `routing_preferences` field. Plano returns an ordered list of candidate models; the client uses the first and falls back to the next on 429 or 5xx errors.
---
## Request Format
Standard OpenAI chat completion body. The only addition is the optional `routing_preferences` field, which is stripped before the request is forwarded upstream.
```json
POST /v1/chat/completions
{
"model": "openai/gpt-4o-mini",
"messages": [
{"role": "user", "content": "write a sorting algorithm in Python"}
In agentic loops where the same session makes multiple LLM calls, send an `X-Model-Affinity` header to pin the routing decision. The first request routes normally and caches the result. All subsequent requests with the same affinity ID return the cached model without re-running routing.
```json
POST /v1/chat/completions
X-Model-Affinity: a1b2c3d4-5678-...
{
"model": "openai/gpt-4o-mini",
"messages": [...]
}
```
The routing decision endpoint also supports model affinity:
```json
POST /routing/v1/chat/completions
X-Model-Affinity: a1b2c3d4-5678-...
```
Response when pinned:
```json
{
"models": ["anthropic/claude-sonnet-4-20250514"],
"route": "code generation",
"trace_id": "...",
"session_id": "a1b2c3d4-5678-...",
"pinned": true
}
```
Without the header, routing runs fresh every time (no breaking change).