| .. | ||
| agent.py | ||
| config.yaml | ||
| demo.py | ||
| demo.sh | ||
| README.md | ||
| start_agents.sh | ||
Model Affinity Demo
Consistent model selection for agentic loops using
X-Model-Affinity.
Why Model Affinity?
When an agent runs in a loop — calling tools, reasoning about results, calling more tools — each LLM request hits Plano's router independently. Because prompts vary in intent (tool selection looks like code generation, reasoning about results looks like complex analysis), the router may select different models for each turn, fragmenting context mid-session.
Model affinity solves this: send an X-Model-Affinity header and the first request runs routing as usual, caching the decision. Every subsequent request with the same affinity ID returns the same model, without re-running the router.
Without affinity With affinity (X-Model-Affinity)
──────────────── ───────────────────────────────
Turn 1 → claude-sonnet (tool calls) Turn 1 → claude-sonnet ← routed
Turn 2 → gpt-4o (reasoning) Turn 2 → claude-sonnet ← pinned ✓
Turn 3 → claude-sonnet (tool calls) Turn 3 → claude-sonnet ← pinned ✓
Turn 4 → gpt-4o (reasoning) Turn 4 → claude-sonnet ← pinned ✓
Turn 5 → claude-sonnet (final answer) Turn 5 → claude-sonnet ← pinned ✓
↑ model switches every turn ↑ one model, start to finish
Quick Start
# 1. Set API keys
export OPENAI_API_KEY=<your-key>
export ANTHROPIC_API_KEY=<your-key>
# 2. Start Plano
cd demos/llm_routing/model_affinity
planoai up config.yaml
# 3. Run the demo (uv manages dependencies automatically)
./demo.sh # or: uv run demo.py
What the Demo Does
A database selection agent investigates whether to use PostgreSQL or MongoDB for an e-commerce platform. It runs a real tool-calling loop: the LLM decides which tools to call, receives simulated results, and continues until it has enough data to recommend a database.
Available tools:
get_db_benchmarks— fetch performance data for a workload typeget_case_studies— retrieve real-world e-commerce case studiescheck_feature_support— check if a database supports a specific feature
The demo runs the same agent loop twice:
- Without affinity — no
X-Model-Affinity; models may switch between turns - With affinity —
X-Model-Affinityheader included; model is pinned from turn 1
Each turn is a separate POST /v1/chat/completions request to Plano using the
OpenAI SDK. The demo prints the
model used on each turn so you can see the difference.
Expected Output
Run 1: WITHOUT Model Affinity
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
turn 1 [claude-sonnet-4-20250514 ] get_db_benchmarks, get_db_benchmarks
turn 2 [gpt-4o ] get_case_studies, get_case_studies ← switched
turn 3 [claude-sonnet-4-20250514 ] check_feature_support ← switched
turn 4 [gpt-4o ] final answer ← switched
✗ Without affinity: model switched 3 time(s)
Run 2: WITH Model Affinity (X-Model-Affinity: a1b2c3d4…)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
turn 1 [claude-sonnet-4-20250514 ] get_db_benchmarks, get_db_benchmarks
turn 2 [claude-sonnet-4-20250514 ] get_case_studies, get_case_studies
turn 3 [claude-sonnet-4-20250514 ] check_feature_support
turn 4 [claude-sonnet-4-20250514 ] final answer
✓ With affinity: claude-sonnet-4-20250514 for all 4 turns
How It Works
Model affinity is implemented in brightstaff. When X-Model-Affinity is present:
- First request — routing runs normally, result is cached keyed by the affinity ID
- Subsequent requests — cache hit skips routing and returns the cached model instantly
The X-Model-Affinity header is forwarded transparently; no changes to your OpenAI
SDK calls beyond adding the header.
from openai import OpenAI
import uuid
client = OpenAI(base_url="http://localhost:12000/v1", api_key="EMPTY")
affinity_id = str(uuid.uuid4())
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
extra_headers={"X-Model-Affinity": affinity_id},
)
Configuration
Model affinity is configurable in config.yaml:
routing:
session_ttl_seconds: 600 # How long affinity lasts (default: 10 min)
session_max_entries: 10000 # Max cached sessions (upper limit: 10000)
Without the X-Model-Affinity header, routing runs fresh every time — no breaking
change to existing clients.
Advanced: Agent Server Demo
The agent.py file is a FastAPI-based agent server that demonstrates a more
complex pattern: an external agent service that forwards X-Model-Affinity
on all outbound calls to Plano. Use start_agents.sh to run it.
See Also
- Model Routing Service Demo — curl-based examples of the routing endpoint