2026-03-25 23:09:50 -07:00
# Session Pinning Demo
> Consistent model selection for agentic loops using `X-Session-Id`.
## Why Session Pinning?
2026-03-26 16:44:05 -07:00
When an agent runs in a loop — research → analyse → implement → evaluate → summarise — each step hits Plano's router independently. Because prompts vary in intent, the router may select **different models** for each step, fragmenting context mid-session.
2026-03-25 23:09:50 -07:00
2026-03-26 16:44:05 -07:00
**Session pinning** solves this: send an `X-Session-Id` header and the first request runs routing as usual, caching the decision. Every subsequent request with the same session ID returns the **same model** , without re-running the router.
2026-03-25 23:09:50 -07:00
```
Without pinning With pinning (X-Session-Id)
2026-03-26 16:44:05 -07:00
───────────────── ──────────────────────────
Step 1 → claude-sonnet (code_gen) Step 1 → claude-sonnet ← routed
Step 2 → gpt-4o (reasoning) Step 2 → claude-sonnet ← pinned ✓
Step 3 → claude-sonnet (code_gen) Step 3 → claude-sonnet ← pinned ✓
Step 4 → gpt-4o (reasoning) Step 4 → claude-sonnet ← pinned ✓
Step 5 → claude-sonnet (code_gen) Step 5 → claude-sonnet ← pinned ✓
2026-03-25 23:09:50 -07:00
↑ model switches every step ↑ one model, start to finish
```
---
## Quick Start
```bash
# 1. Set API keys
export OPENAI_API_KEY=< your-key >
export ANTHROPIC_API_KEY=< your-key >
# 2. Start Plano
cd demos/llm_routing/session_pinning
planoai up config.yaml
2026-03-26 16:44:05 -07:00
# 3. Run the demo (uv manages dependencies automatically)
./demo.sh # or: uv run demo.py
2026-03-25 23:09:50 -07:00
```
---
## What the Demo Does
2026-03-26 16:44:05 -07:00
A **Database Research Agent** investigates whether to use PostgreSQL or MongoDB
for an e-commerce platform. It runs 5 steps, each building on prior findings via
accumulated message history. Steps alternate between `code_generation` and
`complex_reasoning` intents so Plano routes to different models without pinning.
2026-03-25 23:09:50 -07:00
2026-03-26 16:44:05 -07:00
| Step | Task | Intent |
|:----:|------|--------|
| 1 | List technical requirements | code_generation → claude-sonnet |
| 2 | Compare PostgreSQL vs MongoDB | complex_reasoning → gpt-4o |
| 3 | Write schema (CREATE TABLE) | code_generation → claude-sonnet |
| 4 | Assess scalability trade-offs | complex_reasoning → gpt-4o |
| 5 | Write final recommendation report | code_generation → claude-sonnet |
2026-03-25 23:09:50 -07:00
2026-03-26 16:44:05 -07:00
The demo runs the loop **twice** against `/v1/chat/completions` using the
[OpenAI SDK ](https://github.com/openai/openai-python ):
2026-03-25 23:09:50 -07:00
2026-03-26 16:44:05 -07:00
1. **Without pinning** — no `X-Session-Id` ; models alternate per step
2. **With pinning** — `X-Session-Id` header included; model is pinned from step 1
Each step makes real LLM calls. Step 5's report explicitly references findings
from earlier steps, demonstrating why coherent context requires a consistent model.
2026-03-25 23:09:50 -07:00
### Expected Output
```
Run 1: WITHOUT Session Pinning
2026-03-26 16:44:05 -07:00
─────────────────────────────────────────────────────────────────────
step 1 [claude-sonnet-4-20250514] List requirements
"Critical requirements: 1. ACID transactions for order integrity…"
step 2 [gpt-4o ] Compare databases ← switched
"PostgreSQL excels at joins and ACID guarantees…"
step 3 [claude-sonnet-4-20250514] Write schema ← switched
"CREATE TABLE orders (\n id SERIAL PRIMARY KEY…"
step 4 [gpt-4o ] Assess scalability ← switched
"At high write volume, PostgreSQL row-level locking…"
step 5 [claude-sonnet-4-20250514] Write report ← switched
"RECOMMENDATION: PostgreSQL is the right choice…"
✗ Without pinning: model switched 4 time(s) — gpt-4o, claude-sonnet-4-20250514
Run 2: WITH Session Pinning (X-Session-Id: a1b2c3d4…)
─────────────────────────────────────────────────────────────────────
step 1 [claude-sonnet-4-20250514] List requirements
"Critical requirements: 1. ACID transactions for order integrity…"
step 2 [claude-sonnet-4-20250514] Compare databases
"Building on the requirements I just outlined: PostgreSQL…"
step 3 [claude-sonnet-4-20250514] Write schema
"Following the comparison above, here is the PostgreSQL schema…"
step 4 [claude-sonnet-4-20250514] Assess scalability
"Given the schema I designed, PostgreSQL's row-level locking…"
step 5 [claude-sonnet-4-20250514] Write report
"RECOMMENDATION: Based on my analysis of requirements, comparison…"
✓ With pinning: claude-sonnet-4-20250514 held for all 5 steps
══ Final Report (pinned session) ═════════════════════════════════════
RECOMMENDATION: Based on my analysis of requirements, the head-to-head
comparison, the schema I designed, and the scalability trade-offs…
══════════════════════════════════════════════════════════════════════
```
### How It Works
Session pinning is implemented in brightstaff. When `X-Session-Id` is present:
1. **First request** — routing runs normally, result is cached keyed by session ID
2. **Subsequent requests** — cache hit skips routing and returns the cached model instantly
The `X-Session-Id` header is forwarded transparently; no changes to your OpenAI
SDK calls beyond adding the header.
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:12000/v1", api_key="EMPTY")
session_id = str(uuid.uuid4())
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
extra_headers={"X-Session-Id": session_id}, # pin the session
)
2026-03-25 23:09:50 -07:00
```
---
## Configuration
Session pinning is configurable in `config.yaml` :
```yaml
routing:
session_ttl_seconds: 600 # How long a pinned session lasts (default: 10 min)
session_max_entries: 10000 # Max cached sessions before LRU eviction
```
2026-03-26 16:44:05 -07:00
Without the `X-Session-Id` header, routing runs fresh every time — no breaking
change to existing clients.
2026-03-25 23:09:50 -07:00
---
## See Also
2026-03-26 16:44:05 -07:00
- [Model Routing Service Demo ](../model_routing_service/ ) — curl-based examples of the routing endpoint