mirror of
https://github.com/katanemo/plano.git
synced 2026-05-08 07:12:42 +02:00
136 lines
5.3 KiB
Markdown
136 lines
5.3 KiB
Markdown
|
|
# Model Affinity Demo
|
||
|
|
|
||
|
|
> Consistent model selection for agentic loops using `X-Model-Affinity`.
|
||
|
|
|
||
|
|
## Why Model Affinity?
|
||
|
|
|
||
|
|
When an agent runs in a loop — calling tools, reasoning about results, calling more tools — each LLM request hits Plano's router independently. Because prompts vary in intent (tool selection looks like code generation, reasoning about results looks like complex analysis), the router may select **different models** for each turn, fragmenting context mid-session.
|
||
|
|
|
||
|
|
**Model affinity** solves this: send an `X-Model-Affinity` header and the first request runs routing as usual, caching the decision. Every subsequent request with the same affinity ID returns the **same model**, without re-running the router.
|
||
|
|
|
||
|
|
```
|
||
|
|
Without affinity With affinity (X-Model-Affinity)
|
||
|
|
──────────────── ───────────────────────────────
|
||
|
|
Turn 1 → claude-sonnet (tool calls) Turn 1 → claude-sonnet ← routed
|
||
|
|
Turn 2 → gpt-4o (reasoning) Turn 2 → claude-sonnet ← pinned ✓
|
||
|
|
Turn 3 → claude-sonnet (tool calls) Turn 3 → claude-sonnet ← pinned ✓
|
||
|
|
Turn 4 → gpt-4o (reasoning) Turn 4 → claude-sonnet ← pinned ✓
|
||
|
|
Turn 5 → claude-sonnet (final answer) Turn 5 → claude-sonnet ← pinned ✓
|
||
|
|
↑ model switches every turn ↑ one model, start to finish
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Quick Start
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Set API keys
|
||
|
|
export OPENAI_API_KEY=<your-key>
|
||
|
|
export ANTHROPIC_API_KEY=<your-key>
|
||
|
|
|
||
|
|
# 2. Start Plano
|
||
|
|
cd demos/llm_routing/model_affinity
|
||
|
|
planoai up config.yaml
|
||
|
|
|
||
|
|
# 3. Run the demo (uv manages dependencies automatically)
|
||
|
|
./demo.sh # or: uv run demo.py
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## What the Demo Does
|
||
|
|
|
||
|
|
A **database selection agent** investigates whether to use PostgreSQL or MongoDB
|
||
|
|
for an e-commerce platform. It runs a real tool-calling loop: the LLM decides
|
||
|
|
which tools to call, receives simulated results, and continues until it has
|
||
|
|
enough data to recommend a database.
|
||
|
|
|
||
|
|
Available tools:
|
||
|
|
- `get_db_benchmarks` — fetch performance data for a workload type
|
||
|
|
- `get_case_studies` — retrieve real-world e-commerce case studies
|
||
|
|
- `check_feature_support` — check if a database supports a specific feature
|
||
|
|
|
||
|
|
The demo runs the **same agent loop twice**:
|
||
|
|
|
||
|
|
1. **Without affinity** — no `X-Model-Affinity`; models may switch between turns
|
||
|
|
2. **With affinity** — `X-Model-Affinity` header included; model is pinned from turn 1
|
||
|
|
|
||
|
|
Each turn is a separate `POST /v1/chat/completions` request to Plano using the
|
||
|
|
[OpenAI SDK](https://github.com/openai/openai-python). The demo prints the
|
||
|
|
model used on each turn so you can see the difference.
|
||
|
|
|
||
|
|
### Expected Output
|
||
|
|
|
||
|
|
```
|
||
|
|
Run 1: WITHOUT Model Affinity
|
||
|
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||
|
|
turn 1 [claude-sonnet-4-20250514 ] get_db_benchmarks, get_db_benchmarks
|
||
|
|
turn 2 [gpt-4o ] get_case_studies, get_case_studies ← switched
|
||
|
|
turn 3 [claude-sonnet-4-20250514 ] check_feature_support ← switched
|
||
|
|
turn 4 [gpt-4o ] final answer ← switched
|
||
|
|
|
||
|
|
✗ Without affinity: model switched 3 time(s)
|
||
|
|
|
||
|
|
|
||
|
|
Run 2: WITH Model Affinity (X-Model-Affinity: a1b2c3d4…)
|
||
|
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||
|
|
turn 1 [claude-sonnet-4-20250514 ] get_db_benchmarks, get_db_benchmarks
|
||
|
|
turn 2 [claude-sonnet-4-20250514 ] get_case_studies, get_case_studies
|
||
|
|
turn 3 [claude-sonnet-4-20250514 ] check_feature_support
|
||
|
|
turn 4 [claude-sonnet-4-20250514 ] final answer
|
||
|
|
|
||
|
|
✓ With affinity: claude-sonnet-4-20250514 for all 4 turns
|
||
|
|
```
|
||
|
|
|
||
|
|
### How It Works
|
||
|
|
|
||
|
|
Model affinity is implemented in brightstaff. When `X-Model-Affinity` is present:
|
||
|
|
|
||
|
|
1. **First request** — routing runs normally, result is cached keyed by the affinity ID
|
||
|
|
2. **Subsequent requests** — cache hit skips routing and returns the cached model instantly
|
||
|
|
|
||
|
|
The `X-Model-Affinity` header is forwarded transparently; no changes to your OpenAI
|
||
|
|
SDK calls beyond adding the header.
|
||
|
|
|
||
|
|
```python
|
||
|
|
from openai import OpenAI
|
||
|
|
import uuid
|
||
|
|
|
||
|
|
client = OpenAI(base_url="http://localhost:12000/v1", api_key="EMPTY")
|
||
|
|
|
||
|
|
affinity_id = str(uuid.uuid4())
|
||
|
|
|
||
|
|
response = client.chat.completions.create(
|
||
|
|
model="gpt-4o-mini",
|
||
|
|
messages=[{"role": "user", "content": prompt}],
|
||
|
|
extra_headers={"X-Model-Affinity": affinity_id},
|
||
|
|
)
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Configuration
|
||
|
|
|
||
|
|
Model affinity is configurable in `config.yaml`:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
routing:
|
||
|
|
session_ttl_seconds: 600 # How long affinity lasts (default: 10 min)
|
||
|
|
session_max_entries: 10000 # Max cached sessions (upper limit: 10000)
|
||
|
|
```
|
||
|
|
|
||
|
|
Without the `X-Model-Affinity` header, routing runs fresh every time — no breaking
|
||
|
|
change to existing clients.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Advanced: Agent Server Demo
|
||
|
|
|
||
|
|
The `agent.py` file is a FastAPI-based agent server that demonstrates a more
|
||
|
|
complex pattern: an external agent service that forwards `X-Model-Affinity`
|
||
|
|
on all outbound calls to Plano. Use `start_agents.sh` to run it.
|
||
|
|
|
||
|
|
## See Also
|
||
|
|
|
||
|
|
- [Model Routing Service Demo](../model_routing_service/) — curl-based examples of the routing endpoint
|