rename session pinning to model affinity with x-model-affinity header

2026-05-08 15:22:43 +02:00 · 2026-04-08 15:23:53 -07:00 · 2026-04-08 15:23:53 -07:00 · da9792c2dd
commit da9792c2dd
parent 5789694d2f
14 changed files with 468 additions and 371 deletions
--- a/demos/llm_routing/model_affinity/README.md
+++ b/demos/llm_routing/model_affinity/README.md
@ -0,0 +1,135 @@
+# Model Affinity Demo
+
+> Consistent model selection for agentic loops using `X-Model-Affinity`.
+
+## Why Model Affinity?
+
+When an agent runs in a loop — calling tools, reasoning about results, calling more tools — each LLM request hits Plano's router independently. Because prompts vary in intent (tool selection looks like code generation, reasoning about results looks like complex analysis), the router may select **different models** for each turn, fragmenting context mid-session.
+
+**Model affinity** solves this: send an `X-Model-Affinity` header and the first request runs routing as usual, caching the decision. Every subsequent request with the same affinity ID returns the **same model**, without re-running the router.
+
+```
+Without affinity                         With affinity (X-Model-Affinity)
+────────────────                         ───────────────────────────────
+Turn 1 → claude-sonnet  (tool calls)     Turn 1 → claude-sonnet  ← routed
+Turn 2 → gpt-4o         (reasoning)      Turn 2 → claude-sonnet  ← pinned ✓
+Turn 3 → claude-sonnet  (tool calls)     Turn 3 → claude-sonnet  ← pinned ✓
+Turn 4 → gpt-4o         (reasoning)      Turn 4 → claude-sonnet  ← pinned ✓
+Turn 5 → claude-sonnet  (final answer)   Turn 5 → claude-sonnet  ← pinned ✓
+       ↑ model switches every turn                ↑ one model, start to finish
+```
+
+---
+
+## Quick Start
+
+```bash
+# 1. Set API keys
+export OPENAI_API_KEY=<your-key>
+export ANTHROPIC_API_KEY=<your-key>
+
+# 2. Start Plano
+cd demos/llm_routing/model_affinity
+planoai up config.yaml
+
+# 3. Run the demo (uv manages dependencies automatically)
+./demo.sh          # or: uv run demo.py
+```
+
+---
+
+## What the Demo Does
+
+A **database selection agent** investigates whether to use PostgreSQL or MongoDB
+for an e-commerce platform. It runs a real tool-calling loop: the LLM decides
+which tools to call, receives simulated results, and continues until it has
+enough data to recommend a database.
+
+Available tools:
+- `get_db_benchmarks` — fetch performance data for a workload type
+- `get_case_studies` — retrieve real-world e-commerce case studies
+- `check_feature_support` — check if a database supports a specific feature
+
+The demo runs the **same agent loop twice**:
+
+1. **Without affinity** — no `X-Model-Affinity`; models may switch between turns
+2. **With affinity** — `X-Model-Affinity` header included; model is pinned from turn 1
+
+Each turn is a separate `POST /v1/chat/completions` request to Plano using the
+[OpenAI SDK](https://github.com/openai/openai-python). The demo prints the
+model used on each turn so you can see the difference.
+
+### Expected Output
+
+```
+  Run 1: WITHOUT Model Affinity
+  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+    turn 1  [claude-sonnet-4-20250514     ]  get_db_benchmarks, get_db_benchmarks
+    turn 2  [gpt-4o                       ]  get_case_studies, get_case_studies     ← switched
+    turn 3  [claude-sonnet-4-20250514     ]  check_feature_support                 ← switched
+    turn 4  [gpt-4o                       ]  final answer                          ← switched
+
+  ✗  Without affinity: model switched 3 time(s)
+
+
+  Run 2: WITH Model Affinity  (X-Model-Affinity: a1b2c3d4…)
+  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+    turn 1  [claude-sonnet-4-20250514     ]  get_db_benchmarks, get_db_benchmarks
+    turn 2  [claude-sonnet-4-20250514     ]  get_case_studies, get_case_studies
+    turn 3  [claude-sonnet-4-20250514     ]  check_feature_support
+    turn 4  [claude-sonnet-4-20250514     ]  final answer
+
+  ✓  With affinity: claude-sonnet-4-20250514 for all 4 turns
+```
+
+### How It Works
+
+Model affinity is implemented in brightstaff. When `X-Model-Affinity` is present:
+
+1. **First request** — routing runs normally, result is cached keyed by the affinity ID
+2. **Subsequent requests** — cache hit skips routing and returns the cached model instantly
+
+The `X-Model-Affinity` header is forwarded transparently; no changes to your OpenAI
+SDK calls beyond adding the header.
+
+```python
+from openai import OpenAI
+import uuid
+
+client = OpenAI(base_url="http://localhost:12000/v1", api_key="EMPTY")
+
+affinity_id = str(uuid.uuid4())
+
+response = client.chat.completions.create(
+    model="gpt-4o-mini",
+    messages=[{"role": "user", "content": prompt}],
+    extra_headers={"X-Model-Affinity": affinity_id},
+)
+```
+
+---
+
+## Configuration
+
+Model affinity is configurable in `config.yaml`:
+
+```yaml
+routing:
+  session_ttl_seconds: 600      # How long affinity lasts (default: 10 min)
+  session_max_entries: 10000    # Max cached sessions (upper limit: 10000)
+```
+
+Without the `X-Model-Affinity` header, routing runs fresh every time — no breaking
+change to existing clients.
+
+---
+
+## Advanced: Agent Server Demo
+
+The `agent.py` file is a FastAPI-based agent server that demonstrates a more
+complex pattern: an external agent service that forwards `X-Model-Affinity`
+on all outbound calls to Plano. Use `start_agents.sh` to run it.
+
+## See Also
+
+- [Model Routing Service Demo](../model_routing_service/) — curl-based examples of the routing endpoint