plano/demos/llm_routing/model_affinity
2026-04-08 15:23:53 -07:00
..
agent.py rename session pinning to model affinity with x-model-affinity header 2026-04-08 15:23:53 -07:00
config.yaml rename session pinning to model affinity with x-model-affinity header 2026-04-08 15:23:53 -07:00
demo.py rename session pinning to model affinity with x-model-affinity header 2026-04-08 15:23:53 -07:00
demo.sh rename session pinning to model affinity with x-model-affinity header 2026-04-08 15:23:53 -07:00
README.md rename session pinning to model affinity with x-model-affinity header 2026-04-08 15:23:53 -07:00
start_agents.sh rename session pinning to model affinity with x-model-affinity header 2026-04-08 15:23:53 -07:00

Model Affinity Demo

Consistent model selection for agentic loops using X-Model-Affinity.

Why Model Affinity?

When an agent runs in a loop — calling tools, reasoning about results, calling more tools — each LLM request hits Plano's router independently. Because prompts vary in intent (tool selection looks like code generation, reasoning about results looks like complex analysis), the router may select different models for each turn, fragmenting context mid-session.

Model affinity solves this: send an X-Model-Affinity header and the first request runs routing as usual, caching the decision. Every subsequent request with the same affinity ID returns the same model, without re-running the router.

Without affinity                         With affinity (X-Model-Affinity)
────────────────                         ───────────────────────────────
Turn 1 → claude-sonnet  (tool calls)     Turn 1 → claude-sonnet  ← routed
Turn 2 → gpt-4o         (reasoning)      Turn 2 → claude-sonnet  ← pinned ✓
Turn 3 → claude-sonnet  (tool calls)     Turn 3 → claude-sonnet  ← pinned ✓
Turn 4 → gpt-4o         (reasoning)      Turn 4 → claude-sonnet  ← pinned ✓
Turn 5 → claude-sonnet  (final answer)   Turn 5 → claude-sonnet  ← pinned ✓
       ↑ model switches every turn                ↑ one model, start to finish

Quick Start

# 1. Set API keys
export OPENAI_API_KEY=<your-key>
export ANTHROPIC_API_KEY=<your-key>

# 2. Start Plano
cd demos/llm_routing/model_affinity
planoai up config.yaml

# 3. Run the demo (uv manages dependencies automatically)
./demo.sh          # or: uv run demo.py

What the Demo Does

A database selection agent investigates whether to use PostgreSQL or MongoDB for an e-commerce platform. It runs a real tool-calling loop: the LLM decides which tools to call, receives simulated results, and continues until it has enough data to recommend a database.

Available tools:

  • get_db_benchmarks — fetch performance data for a workload type
  • get_case_studies — retrieve real-world e-commerce case studies
  • check_feature_support — check if a database supports a specific feature

The demo runs the same agent loop twice:

  1. Without affinity — no X-Model-Affinity; models may switch between turns
  2. With affinityX-Model-Affinity header included; model is pinned from turn 1

Each turn is a separate POST /v1/chat/completions request to Plano using the OpenAI SDK. The demo prints the model used on each turn so you can see the difference.

Expected Output

  Run 1: WITHOUT Model Affinity
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    turn 1  [claude-sonnet-4-20250514     ]  get_db_benchmarks, get_db_benchmarks
    turn 2  [gpt-4o                       ]  get_case_studies, get_case_studies     ← switched
    turn 3  [claude-sonnet-4-20250514     ]  check_feature_support                 ← switched
    turn 4  [gpt-4o                       ]  final answer                          ← switched

  ✗  Without affinity: model switched 3 time(s)


  Run 2: WITH Model Affinity  (X-Model-Affinity: a1b2c3d4…)
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    turn 1  [claude-sonnet-4-20250514     ]  get_db_benchmarks, get_db_benchmarks
    turn 2  [claude-sonnet-4-20250514     ]  get_case_studies, get_case_studies
    turn 3  [claude-sonnet-4-20250514     ]  check_feature_support
    turn 4  [claude-sonnet-4-20250514     ]  final answer

  ✓  With affinity: claude-sonnet-4-20250514 for all 4 turns

How It Works

Model affinity is implemented in brightstaff. When X-Model-Affinity is present:

  1. First request — routing runs normally, result is cached keyed by the affinity ID
  2. Subsequent requests — cache hit skips routing and returns the cached model instantly

The X-Model-Affinity header is forwarded transparently; no changes to your OpenAI SDK calls beyond adding the header.

from openai import OpenAI
import uuid

client = OpenAI(base_url="http://localhost:12000/v1", api_key="EMPTY")

affinity_id = str(uuid.uuid4())

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    extra_headers={"X-Model-Affinity": affinity_id},
)

Configuration

Model affinity is configurable in config.yaml:

routing:
  session_ttl_seconds: 600      # How long affinity lasts (default: 10 min)
  session_max_entries: 10000    # Max cached sessions (upper limit: 10000)

Without the X-Model-Affinity header, routing runs fresh every time — no breaking change to existing clients.


Advanced: Agent Server Demo

The agent.py file is a FastAPI-based agent server that demonstrates a more complex pattern: an external agent service that forwards X-Model-Affinity on all outbound calls to Plano. Use start_agents.sh to run it.

See Also