plano/demos/llm_routing/session_pinning
Adil Hafeez 0105897692 add session pinning to llm_chat handler and rewrite session pinning demo
- extend brightstaff llm_chat_inner to extract X-Session-Id, check the
  session cache before routing, and cache the result afterward — same
  pattern as routing_service.rs
- replace old urllib-based demo with a real FastAPI research agent that
  runs 3 independent tool-calling tasks with alternating intents so
  Plano routes to different models; demo.py is a pure httpx client that
  shows the routing trace side-by-side with and without session pinning
2026-03-26 16:44:05 -07:00
..
agent.py add session pinning to llm_chat handler and rewrite session pinning demo 2026-03-26 16:44:05 -07:00
config.yaml add session pinning demo with iterative research agent 2026-03-25 23:09:50 -07:00
demo.py add session pinning to llm_chat handler and rewrite session pinning demo 2026-03-26 16:44:05 -07:00
demo.sh add session pinning to llm_chat handler and rewrite session pinning demo 2026-03-26 16:44:05 -07:00
README.md add session pinning to llm_chat handler and rewrite session pinning demo 2026-03-26 16:44:05 -07:00
start_agents.sh add session pinning to llm_chat handler and rewrite session pinning demo 2026-03-26 16:44:05 -07:00

Session Pinning Demo

Consistent model selection for agentic loops using X-Session-Id.

Why Session Pinning?

When an agent runs in a loop — research → analyse → implement → evaluate → summarise — each step hits Plano's router independently. Because prompts vary in intent, the router may select different models for each step, fragmenting context mid-session.

Session pinning solves this: send an X-Session-Id header and the first request runs routing as usual, caching the decision. Every subsequent request with the same session ID returns the same model, without re-running the router.

Without pinning                          With pinning (X-Session-Id)
─────────────────                        ──────────────────────────
Step 1 → claude-sonnet  (code_gen)       Step 1 → claude-sonnet  ← routed
Step 2 → gpt-4o         (reasoning)      Step 2 → claude-sonnet  ← pinned ✓
Step 3 → claude-sonnet  (code_gen)       Step 3 → claude-sonnet  ← pinned ✓
Step 4 → gpt-4o         (reasoning)      Step 4 → claude-sonnet  ← pinned ✓
Step 5 → claude-sonnet  (code_gen)       Step 5 → claude-sonnet  ← pinned ✓
       ↑ model switches every step                ↑ one model, start to finish

Quick Start

# 1. Set API keys
export OPENAI_API_KEY=<your-key>
export ANTHROPIC_API_KEY=<your-key>

# 2. Start Plano
cd demos/llm_routing/session_pinning
planoai up config.yaml

# 3. Run the demo (uv manages dependencies automatically)
./demo.sh          # or: uv run demo.py

What the Demo Does

A Database Research Agent investigates whether to use PostgreSQL or MongoDB for an e-commerce platform. It runs 5 steps, each building on prior findings via accumulated message history. Steps alternate between code_generation and complex_reasoning intents so Plano routes to different models without pinning.

Step Task Intent
1 List technical requirements code_generation → claude-sonnet
2 Compare PostgreSQL vs MongoDB complex_reasoning → gpt-4o
3 Write schema (CREATE TABLE) code_generation → claude-sonnet
4 Assess scalability trade-offs complex_reasoning → gpt-4o
5 Write final recommendation report code_generation → claude-sonnet

The demo runs the loop twice against /v1/chat/completions using the OpenAI SDK:

  1. Without pinning — no X-Session-Id; models alternate per step
  2. With pinningX-Session-Id header included; model is pinned from step 1

Each step makes real LLM calls. Step 5's report explicitly references findings from earlier steps, demonstrating why coherent context requires a consistent model.

Expected Output

  Run 1: WITHOUT Session Pinning
  ─────────────────────────────────────────────────────────────────────
  step 1  [claude-sonnet-4-20250514]  List requirements
          "Critical requirements: 1. ACID transactions for order integrity…"

  step 2  [gpt-4o                 ]  Compare databases    ← switched
          "PostgreSQL excels at joins and ACID guarantees…"

  step 3  [claude-sonnet-4-20250514]  Write schema        ← switched
          "CREATE TABLE orders (\n  id SERIAL PRIMARY KEY…"

  step 4  [gpt-4o                 ]  Assess scalability   ← switched
          "At high write volume, PostgreSQL row-level locking…"

  step 5  [claude-sonnet-4-20250514]  Write report        ← switched
          "RECOMMENDATION: PostgreSQL is the right choice…"

  ✗  Without pinning: model switched 4 time(s) — gpt-4o, claude-sonnet-4-20250514


  Run 2: WITH Session Pinning  (X-Session-Id: a1b2c3d4…)
  ─────────────────────────────────────────────────────────────────────
  step 1  [claude-sonnet-4-20250514]  List requirements
          "Critical requirements: 1. ACID transactions for order integrity…"

  step 2  [claude-sonnet-4-20250514]  Compare databases
          "Building on the requirements I just outlined: PostgreSQL…"

  step 3  [claude-sonnet-4-20250514]  Write schema
          "Following the comparison above, here is the PostgreSQL schema…"

  step 4  [claude-sonnet-4-20250514]  Assess scalability
          "Given the schema I designed, PostgreSQL's row-level locking…"

  step 5  [claude-sonnet-4-20250514]  Write report
          "RECOMMENDATION: Based on my analysis of requirements, comparison…"

  ✓  With pinning: claude-sonnet-4-20250514 held for all 5 steps

  ══ Final Report (pinned session) ═════════════════════════════════════
  RECOMMENDATION: Based on my analysis of requirements, the head-to-head
  comparison, the schema I designed, and the scalability trade-offs…
  ══════════════════════════════════════════════════════════════════════

How It Works

Session pinning is implemented in brightstaff. When X-Session-Id is present:

  1. First request — routing runs normally, result is cached keyed by session ID
  2. Subsequent requests — cache hit skips routing and returns the cached model instantly

The X-Session-Id header is forwarded transparently; no changes to your OpenAI SDK calls beyond adding the header.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:12000/v1", api_key="EMPTY")

session_id = str(uuid.uuid4())

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    extra_headers={"X-Session-Id": session_id},  # pin the session
)

Configuration

Session pinning is configurable in config.yaml:

routing:
  session_ttl_seconds: 600      # How long a pinned session lasts (default: 10 min)
  session_max_entries: 10000    # Max cached sessions before LRU eviction

Without the X-Session-Id header, routing runs fresh every time — no breaking change to existing clients.


See Also