With the rapid proliferation of large language models (LLMs) — each optimized for different strengths, style, or latency/cost profile — routing has become an essential technique to operationalize the use of different models. Plano provides three distinct routing approaches to meet different use cases: :ref:`Model-based routing <model_based_routing>`, :ref:`Alias-based routing <alias_based_routing>`, and :ref:`Preference-aligned routing <preference_aligned_routing>`. This enables optimal performance, cost efficiency, and response quality by matching requests with the most suitable model from your available LLM fleet.
Preference-aligned routing uses the `Plano-Orchestrator <https://huggingface.co/katanemo/Plano-Orchestrator-30B-A3B>`_ model to pick the best LLM based on domain, action, and your configured preferences instead of hard-coding a model.
Plano-Orchestrator analyzes each prompt to infer domain and action, then applies your preferences to select a model. This decouples **routing policy** (how to choose) from **model assignment** (what to run), making routing transparent, controllable, and easy to extend as you add or swap models.
Plano-Orchestrator is a **preference-based routing model** specifically designed to address the limitations of traditional LLM routing. It delivers production-ready performance with low latency and high accuracy while solving key routing challenges.
Unlike benchmark-driven approaches, Plano-Orchestrator learns to match queries with human preferences by using domain-action mappings that capture subjective evaluation criteria, ensuring routing decisions align with real-world user needs.
The system supports seamlessly adding new models for routing without requiring retraining or architectural modifications, enabling dynamic adaptation to evolving model landscapes.
**Preference-Encoded Routing**
Provides a practical mechanism to encode user preferences through domain-action mappings, offering transparent and controllable routing decisions that can be customized for specific use cases.
Both domain and action configs are associated with preferred models or model variants. At inference time, Plano-Orchestrator analyzes the incoming prompt to infer its domain and action using semantic similarity, task indicators, and contextual cues. It then applies the user-defined routing preferences to select the model best suited to handle the request.
By default, Plano uses a hosted Plano-Orchestrator endpoint. To run Plano-Orchestrator locally, you can serve the model yourself using either **Ollama** or **vLLM**.
In agentic loops — where a single user request triggers multiple LLM calls through tool use — Plano's router classifies each turn independently. Because successive prompts differ in intent (tool selection looks like code generation, reasoning about results looks like analysis), the router may select different models mid-session. This causes behavioral inconsistency and invalidates provider-side KV caches, increasing both latency and cost.
**Model affinity** pins the routing decision for the duration of a session. Send an ``X-Model-Affinity`` header with any string identifier (typically a UUID). The first request routes normally and caches the result. All subsequent requests with the same affinity ID skip routing and reuse the cached model.
By default, Plano stores session affinity state in an in-process LRU cache. This works well for single-instance deployments, but sessions are not shared across replicas — each instance has its own independent cache.
For deployments with multiple Plano replicas (Kubernetes, Docker Compose with ``scale``, or any load-balanced setup), use Redis as the session cache backend. All replicas connect to the same Redis instance, so an affinity decision made by one replica is honoured by every other replica in the pool.
**In-memory (default)**
No configuration required. Sessions live only for the lifetime of the process and are lost on restart.
..code-block:: yaml
routing:
session_ttl_seconds: 600 # How long affinity lasts (default: 10 min)
Requires a reachable Redis instance. The ``url`` field supports standard Redis URI syntax, including authentication (``redis://:password@host:6379``) and TLS (``rediss://host:6380``). Redis handles TTL expiry natively, so no periodic cleanup is needed.
..code-block:: yaml
routing:
session_ttl_seconds: 600
session_cache:
type: redis
url: redis://localhost:6379
..note::
When using Redis in a multi-tenant environment, construct the ``X-Model-Affinity`` header value to include a tenant identifier, for example ``{tenant_id}:{session_id}``. Plano stores each key under the internal namespace ``plano:affinity:{key}``, so tenant-scoped values avoid cross-tenant collisions without any additional configuration.
**Example: Kubernetes multi-replica deployment**
Deploy a Redis instance alongside your Plano pods and point all replicas at it:
..code-block:: yaml
routing:
session_ttl_seconds: 600
session_cache:
type: redis
url: redis://redis.plano.svc.cluster.local:6379
With this configuration, any replica that first receives a request for affinity ID ``abc-123`` caches the routing decision in Redis. Subsequent requests for ``abc-123`` — regardless of which replica they land on — retrieve the same pinned model.
-**Coding Tasks**: Distinguish between code generation requests ("write a Python function"), debugging needs ("fix this error"), and code optimization ("make this faster"), routing each to appropriately specialized models.
-**Content Processing Workflows**: Classify requests as summarization ("summarize this document"), translation ("translate to Spanish"), or analysis ("what are the key themes"), enabling targeted model selection.
-**Multi-Domain Applications**: Accurately identify whether requests fall into legal, healthcare, technical, or general domains, even when the subject matter isn't explicitly stated in the prompt.
-**Conversational Routing**: Track conversation context to identify when topics shift between domains or when the type of assistance needed changes mid-conversation.
-**💡 Clear Usage Description:** Make your route names and descriptions specific, unambiguous, and minimizing overlap between routes. The Router performs better when it can clearly distinguish between different types of requests.
- ❌ Bad:
```
{"name": "math", "description": "anything closely related to mathematics"}
```
- ✅ Good:
```
{"name": "math", "description": "solving, explaining math problems, concepts"}
```
-**💡Nouns Descriptor:** Preference-based routers perform better with noun-centric descriptors, as they offer more stable and semantically rich signals for matching.
-**💡Domain Inclusion:** for best user experience, you should always include a domain route. This helps the router fall back to domain when action is not confidently inferred.
-**Multi-modality**: The model is not trained to process raw image or audio inputs. It can handle textual queries *about* these modalities (e.g., "generate an image of a cat"), but cannot interpret encoded multimedia data directly.
-**Function calling**: Plano-Orchestrator is designed for **semantic preference matching**, not exact intent classification or tool execution. For structured function invocation, use models in the Plano Function Calling collection instead.
-**System prompt dependency**: Plano-Orchestrator routes based solely on the user’s conversation history. It does not use or rely on system prompts for routing decisions.