plano/demos/llm_routing/model_affinity/README.md

# Model Affinity Demo

> Consistent model selection for agentic loops using `X-Model-Affinity`.

## Why Model Affinity?

When an agent runs in a loop — calling tools, reasoning about results, calling more tools — each LLM request hits Plano's router independently. Because prompts vary in intent (tool selection looks like code generation, reasoning about results looks like complex analysis), the router may select **different models** for each turn, fragmenting context mid-session.

**Model affinity** solves this: send an `X-Model-Affinity` header and the first request runs routing as usual, caching the decision. Every subsequent request with the same affinity ID returns the **same model**, without re-running the router.

```
Without affinity                         With affinity (X-Model-Affinity)
────────────────                         ───────────────────────────────
Turn 1 → claude-sonnet  (tool calls)     Turn 1 → claude-sonnet  ← routed
Turn 2 → gpt-4o         (reasoning)      Turn 2 → claude-sonnet  ← pinned ✓
Turn 3 → claude-sonnet  (tool calls)     Turn 3 → claude-sonnet  ← pinned ✓
Turn 4 → gpt-4o         (reasoning)      Turn 4 → claude-sonnet  ← pinned ✓
Turn 5 → claude-sonnet  (final answer)   Turn 5 → claude-sonnet  ← pinned ✓
       ↑ model switches every turn                ↑ one model, start to finish
```

---

## Quick Start

```bash
# 1. Set API keys
export OPENAI_API_KEY=<your-key>
export ANTHROPIC_API_KEY=<your-key>

# 2. Start Plano
cd demos/llm_routing/model_affinity
planoai up config.yaml

# 3. Run the demo (uv manages dependencies automatically)
./demo.sh          # or: uv run demo.py
```

---

## What the Demo Does

A **database selection agent** investigates whether to use PostgreSQL or MongoDB
for an e-commerce platform. It runs a real tool-calling loop: the LLM decides
which tools to call, receives simulated results, and continues until it has
enough data to recommend a database.

Available tools:
- `get_db_benchmarks` — fetch performance data for a workload type
- `get_case_studies` — retrieve real-world e-commerce case studies
- `check_feature_support` — check if a database supports a specific feature

The demo runs the **same agent loop twice**:

1. **Without affinity** — no `X-Model-Affinity`; models may switch between turns
2. **With affinity** — `X-Model-Affinity` header included; model is pinned from turn 1

Each turn is a separate `POST /v1/chat/completions` request to Plano using the
[OpenAI SDK](https://github.com/openai/openai-python). The demo prints the
model used on each turn so you can see the difference.

### Expected Output

```
  Run 1: WITHOUT Model Affinity
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    turn 1  [claude-sonnet-4-20250514     ]  get_db_benchmarks, get_db_benchmarks
    turn 2  [gpt-4o                       ]  get_case_studies, get_case_studies     ← switched
    turn 3  [claude-sonnet-4-20250514     ]  check_feature_support                 ← switched
    turn 4  [gpt-4o                       ]  final answer                          ← switched

  ✗  Without affinity: model switched 3 time(s)


  Run 2: WITH Model Affinity  (X-Model-Affinity: a1b2c3d4…)
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    turn 1  [claude-sonnet-4-20250514     ]  get_db_benchmarks, get_db_benchmarks
    turn 2  [claude-sonnet-4-20250514     ]  get_case_studies, get_case_studies
    turn 3  [claude-sonnet-4-20250514     ]  check_feature_support
    turn 4  [claude-sonnet-4-20250514     ]  final answer

  ✓  With affinity: claude-sonnet-4-20250514 for all 4 turns
```

### How It Works

Model affinity is implemented in brightstaff. When `X-Model-Affinity` is present:

1. **First request** — routing runs normally, result is cached keyed by the affinity ID
2. **Subsequent requests** — cache hit skips routing and returns the cached model instantly

The `X-Model-Affinity` header is forwarded transparently; no changes to your OpenAI
SDK calls beyond adding the header.

```python
from openai import OpenAI
import uuid

client = OpenAI(base_url="http://localhost:12000/v1", api_key="EMPTY")

affinity_id = str(uuid.uuid4())

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    extra_headers={"X-Model-Affinity": affinity_id},
)
```

---

## Configuration

Model affinity is configurable in `config.yaml`:

```yaml
routing:
  session_ttl_seconds: 600      # How long affinity lasts (default: 10 min)
  session_max_entries: 10000    # Max cached sessions (upper limit: 10000)
```

Without the `X-Model-Affinity` header, routing runs fresh every time — no breaking
change to existing clients.

---

## Advanced: Agent Server Demo

The `agent.py` file is a FastAPI-based agent server that demonstrates a more
complex pattern: an external agent service that forwards `X-Model-Affinity`
on all outbound calls to Plano. Use `start_agents.sh` to run it.

## See Also

- [Model Routing Service Demo](../model_routing_service/) — curl-based examples of the routing endpoint
rename session pinning to model affinity with x-model-affinity header 2026-04-08 15:23:53 -07:00			`# Model Affinity Demo`

			> Consistent model selection for agentic loops using `X-Model-Affinity`.

			`## Why Model Affinity?`

			`When an agent runs in a loop — calling tools, reasoning about results, calling more tools — each LLM request hits Plano's router independently. Because prompts vary in intent (tool selection looks like code generation, reasoning about results looks like complex analysis), the router may select different models for each turn, fragmenting context mid-session.`

			Model affinity solves this: send an `X-Model-Affinity` header and the first request runs routing as usual, caching the decision. Every subsequent request with the same affinity ID returns the same model, without re-running the router.

			```
			`Without affinity With affinity (X-Model-Affinity)`
			`──────────────── ───────────────────────────────`
			`Turn 1 → claude-sonnet (tool calls) Turn 1 → claude-sonnet ← routed`
			`Turn 2 → gpt-4o (reasoning) Turn 2 → claude-sonnet ← pinned ✓`
			`Turn 3 → claude-sonnet (tool calls) Turn 3 → claude-sonnet ← pinned ✓`
			`Turn 4 → gpt-4o (reasoning) Turn 4 → claude-sonnet ← pinned ✓`
			`Turn 5 → claude-sonnet (final answer) Turn 5 → claude-sonnet ← pinned ✓`
			`↑ model switches every turn ↑ one model, start to finish`
			```

			`---`

			`## Quick Start`

			```bash
			`# 1. Set API keys`
			`export OPENAI_API_KEY=<your-key>`
			`export ANTHROPIC_API_KEY=<your-key>`

			`# 2. Start Plano`
			`cd demos/llm_routing/model_affinity`
			`planoai up config.yaml`

			`# 3. Run the demo (uv manages dependencies automatically)`
			`./demo.sh # or: uv run demo.py`
			```

			`---`

			`## What the Demo Does`

			`A database selection agent investigates whether to use PostgreSQL or MongoDB`
			`for an e-commerce platform. It runs a real tool-calling loop: the LLM decides`
			`which tools to call, receives simulated results, and continues until it has`
			`enough data to recommend a database.`

			`Available tools:`
			- `get_db_benchmarks` — fetch performance data for a workload type
			- `get_case_studies` — retrieve real-world e-commerce case studies
			- `check_feature_support` — check if a database supports a specific feature

			`The demo runs the same agent loop twice:`

			1. Without affinity — no `X-Model-Affinity`; models may switch between turns
			2. With affinity — `X-Model-Affinity` header included; model is pinned from turn 1

			Each turn is a separate `POST /v1/chat/completions` request to Plano using the
			`[OpenAI SDK](https://github.com/openai/openai-python). The demo prints the`
			`model used on each turn so you can see the difference.`

			`### Expected Output`

			```
			`Run 1: WITHOUT Model Affinity`
			`━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━`
			`turn 1 [claude-sonnet-4-20250514 ] get_db_benchmarks, get_db_benchmarks`
			`turn 2 [gpt-4o ] get_case_studies, get_case_studies ← switched`
			`turn 3 [claude-sonnet-4-20250514 ] check_feature_support ← switched`
			`turn 4 [gpt-4o ] final answer ← switched`

			`✗ Without affinity: model switched 3 time(s)`


			`Run 2: WITH Model Affinity (X-Model-Affinity: a1b2c3d4…)`
			`━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━`
			`turn 1 [claude-sonnet-4-20250514 ] get_db_benchmarks, get_db_benchmarks`
			`turn 2 [claude-sonnet-4-20250514 ] get_case_studies, get_case_studies`
			`turn 3 [claude-sonnet-4-20250514 ] check_feature_support`
			`turn 4 [claude-sonnet-4-20250514 ] final answer`

			`✓ With affinity: claude-sonnet-4-20250514 for all 4 turns`
			```

			`### How It Works`

			Model affinity is implemented in brightstaff. When `X-Model-Affinity` is present:

			`1. First request — routing runs normally, result is cached keyed by the affinity ID`
			`2. Subsequent requests — cache hit skips routing and returns the cached model instantly`

			The `X-Model-Affinity` header is forwarded transparently; no changes to your OpenAI
			`SDK calls beyond adding the header.`

			```python
			`from openai import OpenAI`
			`import uuid`

			`client = OpenAI(base_url="http://localhost:12000/v1", api_key="EMPTY")`

			`affinity_id = str(uuid.uuid4())`

			`response = client.chat.completions.create(`
			`model="gpt-4o-mini",`
			`messages=[{"role": "user", "content": prompt}],`
			`extra_headers={"X-Model-Affinity": affinity_id},`
			`)`
			```

			`---`

			`## Configuration`

			Model affinity is configurable in `config.yaml`:

			```yaml
			`routing:`
			`session_ttl_seconds: 600 # How long affinity lasts (default: 10 min)`
			`session_max_entries: 10000 # Max cached sessions (upper limit: 10000)`
			```

			Without the `X-Model-Affinity` header, routing runs fresh every time — no breaking
			`change to existing clients.`

			`---`

			`## Advanced: Agent Server Demo`

			The `agent.py` file is a FastAPI-based agent server that demonstrates a more
			complex pattern: an external agent service that forwards `X-Model-Affinity`
			on all outbound calls to Plano. Use `start_agents.sh` to run it.

			`## See Also`

			`- [Model Routing Service Demo](../model_routing_service/) — curl-based examples of the routing endpoint`