plano/docs/routing-api.md
Musa 897fda2deb
fix(routing): auto-migrate v0.3.0 inline routing_preferences to v0.4.0 top-level (#912)
* fix(routing): auto-migrate v0.3.0 inline routing_preferences to v0.4.0 top-level

Lift inline routing_preferences under each model_provider into the
top-level routing_preferences list with merged models[] and bump
version to v0.4.0, with a deprecation warning. Existing v0.3.0
demo configs (Claude Code, Codex, preference_based_routing, etc.)
keep working unchanged. Schema flags the inline shape as deprecated
but still accepts it. Docs and skills updated to canonical top-level
multi-model form.

* test(common): bump reference config assertion to v0.4.0

The rendered reference config was bumped to v0.4.0 when its inline
routing_preferences were lifted to the top level; align the
configuration deserialization test with that change.

* fix(config_generator): bump version to v0.4.0 up front in migration

Move the v0.3.0 -> v0.4.0 version bump to the top of
migrate_inline_routing_preferences so it runs unconditionally,
including for configs that already declare top-level
routing_preferences at v0.3.0. Previously the bump only fired
when inline migration produced entries, leaving top-level v0.3.0
configs rejected by brightstaff's v0.4.0 gate. Tests updated to
cover the new behavior and to confirm we never downgrade newer
versions.

* fix(config_generator): gate routing_preferences migration on version < v0.4.0

Short-circuit the migration when the config already declares v0.4.0
or newer. Anything at v0.4.0+ is assumed to be on the canonical
top-level shape and is passed through untouched, including stray
inline preferences (which are the author's bug to fix). Only v0.3.0
and older configs are rewritten and bumped.
2026-04-24 12:31:44 -07:00

5.3 KiB

Plano Routing API — Request & Response Format

Overview

Plano intercepts LLM requests and routes them to the best available model based on semantic intent and live cost/latency data. The developer sends a standard OpenAI-compatible request with an optional routing_preferences field. Plano returns an ordered list of candidate models; the client uses the first and falls back to the next on 429 or 5xx errors.


Request Format

Standard OpenAI chat completion body. The only addition is the optional routing_preferences field, which is stripped before the request is forwarded upstream.

POST /v1/chat/completions
{
  "model": "openai/gpt-4o-mini",
  "messages": [
    {"role": "user", "content": "write a sorting algorithm in Python"}
  ],
  "routing_preferences": [
    {
      "name": "code generation",
      "description": "generating new code snippets",
      "models": ["anthropic/claude-sonnet-4-20250514", "openai/gpt-4o", "openai/gpt-4o-mini"]
    },
    {
      "name": "general questions",
      "description": "casual conversation and simple queries",
      "models": ["openai/gpt-4o-mini"]
    }
  ]
}

routing_preferences fields

Field Type Required Description
name string yes Route identifier. Must match the LLM router's route classification.
description string yes Natural language description used by the router to match user intent.
models string[] yes Ordered candidate pool. At least one entry required. Must be declared in model_providers.

Notes

  • routing_preferences is optional. If omitted, the config-defined preferences are used.
  • If provided in the request body, it overrides the config for that single request only.
  • model is still required and is used as the fallback if no route is matched.

Response Format

{
  "models": [
    "anthropic/claude-sonnet-4-20250514",
    "openai/gpt-4o",
    "openai/gpt-4o-mini"
  ],
  "route": "code generation",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736"
}

Fields

Field Type Description
models string[] Ranked model list. Use models[0] as primary; retry with models[1] on 429/5xx, and so on.
route string null
trace_id string Trace ID for distributed tracing and observability.

Client Usage Pattern

response = plano.routing_decision(request)
models = response["models"]

for model in models:
    try:
        result = call_llm(model, messages)
        break  # success — stop trying
    except (RateLimitError, ServerError):
        continue  # try next model in the ranked list

Configuration (set by platform/ops team)

Requires version: v0.4.0 or above. Models listed under routing_preferences must be declared in model_providers.

version: v0.4.0

model_providers:
  - model: anthropic/claude-sonnet-4-20250514
    access_key: $ANTHROPIC_API_KEY
  - model: openai/gpt-4o
    access_key: $OPENAI_API_KEY
  - model: openai/gpt-4o-mini
    access_key: $OPENAI_API_KEY
    default: true

routing_preferences:
  - name: code generation
    description: generating new code snippets or boilerplate
    models:
      - anthropic/claude-sonnet-4-20250514
      - openai/gpt-4o

  - name: general questions
    description: casual conversation and simple queries
    models:
      - openai/gpt-4o-mini
      - openai/gpt-4o

Model Affinity

In agentic loops where the same session makes multiple LLM calls, send an X-Model-Affinity header to pin the routing decision. The first request routes normally and caches the result. All subsequent requests with the same affinity ID return the cached model without re-running routing.

POST /v1/chat/completions
X-Model-Affinity: a1b2c3d4-5678-...

{
  "model": "openai/gpt-4o-mini",
  "messages": [...]
}

The routing decision endpoint also supports model affinity:

POST /routing/v1/chat/completions
X-Model-Affinity: a1b2c3d4-5678-...

Response when pinned:

{
  "models": ["anthropic/claude-sonnet-4-20250514"],
  "route": "code generation",
  "trace_id": "...",
  "session_id": "a1b2c3d4-5678-...",
  "pinned": true
}

Without the header, routing runs fresh every time (no breaking change).

Configure TTL and cache size:

routing:
  session_ttl_seconds: 600    # default: 10 min
  session_max_entries: 10000  # upper limit

Version Requirements

Version Top-level routing_preferences
< v0.4.0 Not allowed — startup error if present
v0.4.0+ Supported (required for model routing)