plano/demos/llm_routing/session_pinning/README.md

# Session Pinning Demo

> Consistent model selection for agentic loops using `X-Session-Id`.

## Why Session Pinning?

When an agent runs in a loop — research → analyse → implement → evaluate → summarise — each step hits Plano's router independently. Because prompts vary in intent, the router may select **different models** for each step, fragmenting context mid-session.

**Session pinning** solves this: send an `X-Session-Id` header and the first request runs routing as usual, caching the decision. Every subsequent request with the same session ID returns the **same model**, without re-running the router.

```
Without pinning                          With pinning (X-Session-Id)
─────────────────                        ──────────────────────────
Step 1 → claude-sonnet  (code_gen)       Step 1 → claude-sonnet  ← routed
Step 2 → gpt-4o         (reasoning)      Step 2 → claude-sonnet  ← pinned ✓
Step 3 → claude-sonnet  (code_gen)       Step 3 → claude-sonnet  ← pinned ✓
Step 4 → gpt-4o         (reasoning)      Step 4 → claude-sonnet  ← pinned ✓
Step 5 → claude-sonnet  (code_gen)       Step 5 → claude-sonnet  ← pinned ✓
       ↑ model switches every step                ↑ one model, start to finish
```

---

## Quick Start

```bash
# 1. Set API keys
export OPENAI_API_KEY=<your-key>
export ANTHROPIC_API_KEY=<your-key>

# 2. Start Plano
cd demos/llm_routing/session_pinning
planoai up config.yaml

# 3. Run the demo (uv manages dependencies automatically)
./demo.sh          # or: uv run demo.py
```

---

## What the Demo Does

A **Database Research Agent** investigates whether to use PostgreSQL or MongoDB
for an e-commerce platform. It runs 5 steps, each building on prior findings via
accumulated message history. Steps alternate between `code_generation` and
`complex_reasoning` intents so Plano routes to different models without pinning.

| Step | Task | Intent |
|:----:|------|--------|
| 1 | List technical requirements | code_generation → claude-sonnet |
| 2 | Compare PostgreSQL vs MongoDB | complex_reasoning → gpt-4o |
| 3 | Write schema (CREATE TABLE) | code_generation → claude-sonnet |
| 4 | Assess scalability trade-offs | complex_reasoning → gpt-4o |
| 5 | Write final recommendation report | code_generation → claude-sonnet |

The demo runs the loop **twice** against `/v1/chat/completions` using the
[OpenAI SDK](https://github.com/openai/openai-python):

1. **Without pinning** — no `X-Session-Id`; models alternate per step
2. **With pinning** — `X-Session-Id` header included; model is pinned from step 1

Each step makes real LLM calls. Step 5's report explicitly references findings
from earlier steps, demonstrating why coherent context requires a consistent model.

### Expected Output

```
  Run 1: WITHOUT Session Pinning
  ─────────────────────────────────────────────────────────────────────
  step 1  [claude-sonnet-4-20250514]  List requirements
          "Critical requirements: 1. ACID transactions for order integrity…"

  step 2  [gpt-4o                 ]  Compare databases    ← switched
          "PostgreSQL excels at joins and ACID guarantees…"

  step 3  [claude-sonnet-4-20250514]  Write schema        ← switched
          "CREATE TABLE orders (\n  id SERIAL PRIMARY KEY…"

  step 4  [gpt-4o                 ]  Assess scalability   ← switched
          "At high write volume, PostgreSQL row-level locking…"

  step 5  [claude-sonnet-4-20250514]  Write report        ← switched
          "RECOMMENDATION: PostgreSQL is the right choice…"

  ✗  Without pinning: model switched 4 time(s) — gpt-4o, claude-sonnet-4-20250514


  Run 2: WITH Session Pinning  (X-Session-Id: a1b2c3d4…)
  ─────────────────────────────────────────────────────────────────────
  step 1  [claude-sonnet-4-20250514]  List requirements
          "Critical requirements: 1. ACID transactions for order integrity…"

  step 2  [claude-sonnet-4-20250514]  Compare databases
          "Building on the requirements I just outlined: PostgreSQL…"

  step 3  [claude-sonnet-4-20250514]  Write schema
          "Following the comparison above, here is the PostgreSQL schema…"

  step 4  [claude-sonnet-4-20250514]  Assess scalability
          "Given the schema I designed, PostgreSQL's row-level locking…"

  step 5  [claude-sonnet-4-20250514]  Write report
          "RECOMMENDATION: Based on my analysis of requirements, comparison…"

  ✓  With pinning: claude-sonnet-4-20250514 held for all 5 steps

  ══ Final Report (pinned session) ═════════════════════════════════════
  RECOMMENDATION: Based on my analysis of requirements, the head-to-head
  comparison, the schema I designed, and the scalability trade-offs…
  ══════════════════════════════════════════════════════════════════════
```

### How It Works

Session pinning is implemented in brightstaff. When `X-Session-Id` is present:

1. **First request** — routing runs normally, result is cached keyed by session ID
2. **Subsequent requests** — cache hit skips routing and returns the cached model instantly

The `X-Session-Id` header is forwarded transparently; no changes to your OpenAI
SDK calls beyond adding the header.

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:12000/v1", api_key="EMPTY")

session_id = str(uuid.uuid4())

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    extra_headers={"X-Session-Id": session_id},  # pin the session
)
```

---

## Configuration

Session pinning is configurable in `config.yaml`:

```yaml
routing:
  session_ttl_seconds: 600      # How long a pinned session lasts (default: 10 min)
  session_max_entries: 10000    # Max cached sessions before LRU eviction
```

Without the `X-Session-Id` header, routing runs fresh every time — no breaking
change to existing clients.

---

## See Also

- [Model Routing Service Demo](../model_routing_service/) — curl-based examples of the routing endpoint
add session pinning demo with iterative research agent 2026-03-25 23:09:50 -07:00			`# Session Pinning Demo`

			> Consistent model selection for agentic loops using `X-Session-Id`.

			`## Why Session Pinning?`

add session pinning to llm_chat handler and rewrite session pinning demo - extend brightstaff llm_chat_inner to extract X-Session-Id, check the session cache before routing, and cache the result afterward — same pattern as routing_service.rs - replace old urllib-based demo with a real FastAPI research agent that runs 3 independent tool-calling tasks with alternating intents so Plano routes to different models; demo.py is a pure httpx client that shows the routing trace side-by-side with and without session pinning 2026-03-26 16:44:05 -07:00			`When an agent runs in a loop — research → analyse → implement → evaluate → summarise — each step hits Plano's router independently. Because prompts vary in intent, the router may select different models for each step, fragmenting context mid-session.`
add session pinning demo with iterative research agent 2026-03-25 23:09:50 -07:00
add session pinning to llm_chat handler and rewrite session pinning demo - extend brightstaff llm_chat_inner to extract X-Session-Id, check the session cache before routing, and cache the result afterward — same pattern as routing_service.rs - replace old urllib-based demo with a real FastAPI research agent that runs 3 independent tool-calling tasks with alternating intents so Plano routes to different models; demo.py is a pure httpx client that shows the routing trace side-by-side with and without session pinning 2026-03-26 16:44:05 -07:00			Session pinning solves this: send an `X-Session-Id` header and the first request runs routing as usual, caching the decision. Every subsequent request with the same session ID returns the same model, without re-running the router.
add session pinning demo with iterative research agent 2026-03-25 23:09:50 -07:00
			```
			`Without pinning With pinning (X-Session-Id)`
add session pinning to llm_chat handler and rewrite session pinning demo - extend brightstaff llm_chat_inner to extract X-Session-Id, check the session cache before routing, and cache the result afterward — same pattern as routing_service.rs - replace old urllib-based demo with a real FastAPI research agent that runs 3 independent tool-calling tasks with alternating intents so Plano routes to different models; demo.py is a pure httpx client that shows the routing trace side-by-side with and without session pinning 2026-03-26 16:44:05 -07:00			`───────────────── ──────────────────────────`
			`Step 1 → claude-sonnet (code_gen) Step 1 → claude-sonnet ← routed`
			`Step 2 → gpt-4o (reasoning) Step 2 → claude-sonnet ← pinned ✓`
			`Step 3 → claude-sonnet (code_gen) Step 3 → claude-sonnet ← pinned ✓`
			`Step 4 → gpt-4o (reasoning) Step 4 → claude-sonnet ← pinned ✓`
			`Step 5 → claude-sonnet (code_gen) Step 5 → claude-sonnet ← pinned ✓`
add session pinning demo with iterative research agent 2026-03-25 23:09:50 -07:00			`↑ model switches every step ↑ one model, start to finish`
			```

			`---`

			`## Quick Start`

			```bash
			`# 1. Set API keys`
			`export OPENAI_API_KEY=<your-key>`
			`export ANTHROPIC_API_KEY=<your-key>`

			`# 2. Start Plano`
			`cd demos/llm_routing/session_pinning`
			`planoai up config.yaml`

add session pinning to llm_chat handler and rewrite session pinning demo - extend brightstaff llm_chat_inner to extract X-Session-Id, check the session cache before routing, and cache the result afterward — same pattern as routing_service.rs - replace old urllib-based demo with a real FastAPI research agent that runs 3 independent tool-calling tasks with alternating intents so Plano routes to different models; demo.py is a pure httpx client that shows the routing trace side-by-side with and without session pinning 2026-03-26 16:44:05 -07:00			`# 3. Run the demo (uv manages dependencies automatically)`
			`./demo.sh # or: uv run demo.py`
add session pinning demo with iterative research agent 2026-03-25 23:09:50 -07:00			```

			`---`

			`## What the Demo Does`

add session pinning to llm_chat handler and rewrite session pinning demo - extend brightstaff llm_chat_inner to extract X-Session-Id, check the session cache before routing, and cache the result afterward — same pattern as routing_service.rs - replace old urllib-based demo with a real FastAPI research agent that runs 3 independent tool-calling tasks with alternating intents so Plano routes to different models; demo.py is a pure httpx client that shows the routing trace side-by-side with and without session pinning 2026-03-26 16:44:05 -07:00			`A Database Research Agent investigates whether to use PostgreSQL or MongoDB`
			`for an e-commerce platform. It runs 5 steps, each building on prior findings via`
			accumulated message history. Steps alternate between `code_generation` and
			`complex_reasoning` intents so Plano routes to different models without pinning.
add session pinning demo with iterative research agent 2026-03-25 23:09:50 -07:00
add session pinning to llm_chat handler and rewrite session pinning demo - extend brightstaff llm_chat_inner to extract X-Session-Id, check the session cache before routing, and cache the result afterward — same pattern as routing_service.rs - replace old urllib-based demo with a real FastAPI research agent that runs 3 independent tool-calling tasks with alternating intents so Plano routes to different models; demo.py is a pure httpx client that shows the routing trace side-by-side with and without session pinning 2026-03-26 16:44:05 -07:00			`\| Step \| Task \| Intent \|`
			`\|:----:\|------\|--------\|`
			`\| 1 \| List technical requirements \| code_generation → claude-sonnet \|`
			`\| 2 \| Compare PostgreSQL vs MongoDB \| complex_reasoning → gpt-4o \|`
			`\| 3 \| Write schema (CREATE TABLE) \| code_generation → claude-sonnet \|`
			`\| 4 \| Assess scalability trade-offs \| complex_reasoning → gpt-4o \|`
			`\| 5 \| Write final recommendation report \| code_generation → claude-sonnet \|`
add session pinning demo with iterative research agent 2026-03-25 23:09:50 -07:00
add session pinning to llm_chat handler and rewrite session pinning demo - extend brightstaff llm_chat_inner to extract X-Session-Id, check the session cache before routing, and cache the result afterward — same pattern as routing_service.rs - replace old urllib-based demo with a real FastAPI research agent that runs 3 independent tool-calling tasks with alternating intents so Plano routes to different models; demo.py is a pure httpx client that shows the routing trace side-by-side with and without session pinning 2026-03-26 16:44:05 -07:00			The demo runs the loop twice against `/v1/chat/completions` using the
			`[OpenAI SDK](https://github.com/openai/openai-python):`
add session pinning demo with iterative research agent 2026-03-25 23:09:50 -07:00
add session pinning to llm_chat handler and rewrite session pinning demo - extend brightstaff llm_chat_inner to extract X-Session-Id, check the session cache before routing, and cache the result afterward — same pattern as routing_service.rs - replace old urllib-based demo with a real FastAPI research agent that runs 3 independent tool-calling tasks with alternating intents so Plano routes to different models; demo.py is a pure httpx client that shows the routing trace side-by-side with and without session pinning 2026-03-26 16:44:05 -07:00			1. Without pinning — no `X-Session-Id`; models alternate per step
			2. With pinning — `X-Session-Id` header included; model is pinned from step 1

			`Each step makes real LLM calls. Step 5's report explicitly references findings`
			`from earlier steps, demonstrating why coherent context requires a consistent model.`
add session pinning demo with iterative research agent 2026-03-25 23:09:50 -07:00
			`### Expected Output`

			```
			`Run 1: WITHOUT Session Pinning`
add session pinning to llm_chat handler and rewrite session pinning demo - extend brightstaff llm_chat_inner to extract X-Session-Id, check the session cache before routing, and cache the result afterward — same pattern as routing_service.rs - replace old urllib-based demo with a real FastAPI research agent that runs 3 independent tool-calling tasks with alternating intents so Plano routes to different models; demo.py is a pure httpx client that shows the routing trace side-by-side with and without session pinning 2026-03-26 16:44:05 -07:00			`─────────────────────────────────────────────────────────────────────`
			`step 1 [claude-sonnet-4-20250514] List requirements`
			`"Critical requirements: 1. ACID transactions for order integrity…"`

			`step 2 [gpt-4o ] Compare databases ← switched`
			`"PostgreSQL excels at joins and ACID guarantees…"`

			`step 3 [claude-sonnet-4-20250514] Write schema ← switched`
			`"CREATE TABLE orders (\n id SERIAL PRIMARY KEY…"`

			`step 4 [gpt-4o ] Assess scalability ← switched`
			`"At high write volume, PostgreSQL row-level locking…"`

			`step 5 [claude-sonnet-4-20250514] Write report ← switched`
			`"RECOMMENDATION: PostgreSQL is the right choice…"`

			`✗ Without pinning: model switched 4 time(s) — gpt-4o, claude-sonnet-4-20250514`


			`Run 2: WITH Session Pinning (X-Session-Id: a1b2c3d4…)`
			`─────────────────────────────────────────────────────────────────────`
			`step 1 [claude-sonnet-4-20250514] List requirements`
			`"Critical requirements: 1. ACID transactions for order integrity…"`

			`step 2 [claude-sonnet-4-20250514] Compare databases`
			`"Building on the requirements I just outlined: PostgreSQL…"`

			`step 3 [claude-sonnet-4-20250514] Write schema`
			`"Following the comparison above, here is the PostgreSQL schema…"`

			`step 4 [claude-sonnet-4-20250514] Assess scalability`
			`"Given the schema I designed, PostgreSQL's row-level locking…"`

			`step 5 [claude-sonnet-4-20250514] Write report`
			`"RECOMMENDATION: Based on my analysis of requirements, comparison…"`

			`✓ With pinning: claude-sonnet-4-20250514 held for all 5 steps`

			`══ Final Report (pinned session) ═════════════════════════════════════`
			`RECOMMENDATION: Based on my analysis of requirements, the head-to-head`
			`comparison, the schema I designed, and the scalability trade-offs…`
			`══════════════════════════════════════════════════════════════════════`
			```

			`### How It Works`

			Session pinning is implemented in brightstaff. When `X-Session-Id` is present:

			`1. First request — routing runs normally, result is cached keyed by session ID`
			`2. Subsequent requests — cache hit skips routing and returns the cached model instantly`

			The `X-Session-Id` header is forwarded transparently; no changes to your OpenAI
			`SDK calls beyond adding the header.`

			```python
			`from openai import OpenAI`

			`client = OpenAI(base_url="http://localhost:12000/v1", api_key="EMPTY")`

			`session_id = str(uuid.uuid4())`

			`response = client.chat.completions.create(`
			`model="gpt-4o-mini",`
			`messages=[{"role": "user", "content": prompt}],`
			`extra_headers={"X-Session-Id": session_id}, # pin the session`
			`)`
add session pinning demo with iterative research agent 2026-03-25 23:09:50 -07:00			```

			`---`

			`## Configuration`

			Session pinning is configurable in `config.yaml`:

			```yaml
			`routing:`
			`session_ttl_seconds: 600 # How long a pinned session lasts (default: 10 min)`
			`session_max_entries: 10000 # Max cached sessions before LRU eviction`
			```

add session pinning to llm_chat handler and rewrite session pinning demo - extend brightstaff llm_chat_inner to extract X-Session-Id, check the session cache before routing, and cache the result afterward — same pattern as routing_service.rs - replace old urllib-based demo with a real FastAPI research agent that runs 3 independent tool-calling tasks with alternating intents so Plano routes to different models; demo.py is a pure httpx client that shows the routing trace side-by-side with and without session pinning 2026-03-26 16:44:05 -07:00			Without the `X-Session-Id` header, routing runs fresh every time — no breaking
			`change to existing clients.`
add session pinning demo with iterative research agent 2026-03-25 23:09:50 -07:00
			`---`

			`## See Also`

add session pinning to llm_chat handler and rewrite session pinning demo - extend brightstaff llm_chat_inner to extract X-Session-Id, check the session cache before routing, and cache the result afterward — same pattern as routing_service.rs - replace old urllib-based demo with a real FastAPI research agent that runs 3 independent tool-calling tasks with alternating intents so Plano routes to different models; demo.py is a pure httpx client that shows the routing trace side-by-side with and without session pinning 2026-03-26 16:44:05 -07:00			`- [Model Routing Service Demo](../model_routing_service/) — curl-based examples of the routing endpoint`