add Redis session affinity demos (Docker Compose and Kubernetes)

2026-05-18 13:45:15 +02:00 · 2026-04-09 16:32:40 -07:00 · 2026-04-09 16:32:40 -07:00 · 90810078da
commit 90810078da
parent 50670f843d
20 changed files with 2080 additions and 0 deletions
--- a/demos/llm_routing/session_affinity_redis/.env.example
+++ b/demos/llm_routing/session_affinity_redis/.env.example
@ -0,0 +1 @@
+OPENAI_API_KEY=sk-replace-me
--- a/demos/llm_routing/session_affinity_redis/README.md
+++ b/demos/llm_routing/session_affinity_redis/README.md
@ -0,0 +1,247 @@
+# Session Affinity with Redis — Multi-Replica Model Pinning
+
+This demo shows Plano's **session affinity** (`X-Model-Affinity` header) backed by a **Redis session cache** instead of the default in-memory store.
+
+## The Problem
+
+By default, model affinity stores routing decisions in a per-process `HashMap`.
+This works for single-instance deployments, but breaks when you run multiple
+Plano replicas behind a load balancer:
+
+```
+Client ──► Load Balancer ──► Replica A  (session pinned here)
+                         └──► Replica B  (knows nothing about the session)
+```
+
+A request that was pinned to `gpt-4o` on Replica A will be re-routed from
+scratch on Replica B, defeating the purpose of affinity.
+
+## The Solution
+
+Plano's `session_cache` config key accepts a `type: redis` backend that is
+shared across all replicas:
+
+```yaml
+routing:
+  session_ttl_seconds: 300
+  session_cache:
+    type: redis
+    url: redis://localhost:6379
+```
+
+All replicas read and write the same Redis keyspace. A session pinned on any
+replica is immediately visible to all others.
+
+## What to Look For
+
+| What | Expected behaviour |
+|------|--------------------|
+| First request with a session ID | Plano routes normally (via Arch-Router) and writes the result to Redis (`SET session-id ... EX 300`) |
+| Subsequent requests with the **same** session ID | Plano reads from Redis and skips the router — same model every time |
+| Requests with a **different** session ID | Routed independently; may land on a different model |
+| After `session_ttl_seconds` elapses | Redis key expires; next request re-routes and sets a new pin |
+| `x-plano-pinned: true` response header | Tells you the response was served from the session cache |
+
+## Architecture
+
+```
+Client
+  │  X-Model-Affinity: my-session-id
+  ▼
+Plano (brightstaff)
+  ├── GET redis://localhost:6379/my-session-id
+  │     hit?  → return pinned model immediately (no Arch-Router call)
+  │     miss? → call Arch-Router → SET key EX 300 → return routed model
+  ▼
+Redis  (shared across replicas)
+```
+
+## Prerequisites
+
+| Requirement | Notes |
+|-------------|-------|
+| `planoai` CLI | `pip install planoai` |
+| Docker + Docker Compose | For Redis and Jaeger |
+| `OPENAI_API_KEY` | Required for routing model (Arch-Router) and downstream LLMs |
+| Python 3.11+ | Only needed to run `verify_affinity.py` |
+
+## Quick Start
+
+```bash
+# 1. Set your API key
+export OPENAI_API_KEY=sk-...
+# or copy and edit:
+cp .env.example .env
+
+# 2. Start Redis, Jaeger, and Plano
+./run_demo.sh up
+
+# 3. Verify session pinning works
+python verify_affinity.py
+```
+
+## Manual Verification with curl
+
+### Step 1 — Pin a session (first request sets the affinity)
+
+```bash
+curl -s http://localhost:12000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "x-model-affinity: my-session-abc" \
+  -d '{"model":"openai/gpt-4o-mini","messages":[{"role":"user","content":"Write a short poem about the ocean."}]}' \
+  | jq '{model, pinned: .x_plano_pinned}'
+```
+
+Expected output (first request — not yet pinned, Arch-Router picks the model):
+
+```json
+{
+  "model": "openai/gpt-5.2",
+  "pinned": null
+}
+```
+
+### Step 2 — Confirm the pin is held on subsequent requests
+
+```bash
+for i in 1 2 3 4; do
+  curl -s http://localhost:12000/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -H "x-model-affinity: my-session-abc" \
+    -d "{\"model\":\"openai/gpt-4o-mini\",\"messages\":[{\"role\":\"user\",\"content\":\"Request $i\"}]}" \
+    | jq -r '"\(.model)"'
+done
+```
+
+Expected output (same model for every request):
+
+```
+openai/gpt-5.2
+openai/gpt-5.2
+openai/gpt-5.2
+openai/gpt-5.2
+```
+
+### Step 3 — Inspect the Redis key directly
+
+```bash
+docker exec plano-session-redis redis-cli \
+  GET my-session-abc | python3 -m json.tool
+```
+
+Expected output:
+
+```json
+{
+  "model_name": "openai/gpt-5.2",
+  "route_name": "deep_reasoning"
+}
+```
+
+```bash
+# Check the TTL (seconds remaining)
+docker exec plano-session-redis redis-cli TTL my-session-abc
+# e.g. 287
+```
+
+### Step 4 — Different sessions may get different models
+
+```bash
+for session in session-A session-B session-C; do
+  model=$(curl -s http://localhost:12000/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -H "x-model-affinity: $session" \
+    -d '{"model":"openai/gpt-4o-mini","messages":[{"role":"user","content":"Explain quantum entanglement in detail with equations."}]}' \
+    | jq -r '.model')
+  echo "$session -> $model"
+done
+```
+
+Sessions with content matched to `deep_reasoning` will pin to `openai/gpt-5.2`;
+sessions matched to `fast_responses` will pin to `openai/gpt-4o-mini`.
+
+## Verification Script Output
+
+Running `python verify_affinity.py` produces output like:
+
+```
+Plano endpoint : http://localhost:12000/v1/chat/completions
+Sessions       : 3
+Rounds/session : 4
+
+============================================================
+Phase 1: Requests WITHOUT X-Model-Affinity header
+  (model may vary between requests — that is expected)
+============================================================
+  Request 1: model = openai/gpt-4o-mini
+  Request 2: model = openai/gpt-5.2
+  Request 3: model = openai/gpt-4o-mini
+  Models seen across 3 requests: {'openai/gpt-4o-mini', 'openai/gpt-5.2'}
+
+============================================================
+Phase 2: Requests WITH X-Model-Affinity (session pinning)
+  Each session should be pinned to exactly one model.
+============================================================
+
+  Session 'demo-session-001':
+    Round 1: model = openai/gpt-4o-mini  [FIRST — sets affinity]
+    Round 2: model = openai/gpt-4o-mini  [PINNED]
+    Round 3: model = openai/gpt-4o-mini  [PINNED]
+    Round 4: model = openai/gpt-4o-mini  [PINNED]
+
+  Session 'demo-session-002':
+    Round 1: model = openai/gpt-5.2       [FIRST — sets affinity]
+    Round 2: model = openai/gpt-5.2       [PINNED]
+    Round 3: model = openai/gpt-5.2       [PINNED]
+    Round 4: model = openai/gpt-5.2       [PINNED]
+
+  Session 'demo-session-003':
+    Round 1: model = openai/gpt-4o-mini  [FIRST — sets affinity]
+    Round 2: model = openai/gpt-4o-mini  [PINNED]
+    Round 3: model = openai/gpt-4o-mini  [PINNED]
+    Round 4: model = openai/gpt-4o-mini  [PINNED]
+
+============================================================
+Results
+============================================================
+  PASS  demo-session-001 -> always routed to 'openai/gpt-4o-mini'
+  PASS  demo-session-002 -> always routed to 'openai/gpt-5.2'
+  PASS  demo-session-003 -> always routed to 'openai/gpt-4o-mini'
+
+All sessions were pinned consistently.
+Redis session cache is working correctly.
+```
+
+## Observability
+
+Open Jaeger at **http://localhost:16686** and select service `plano`.
+
+- Requests **without** affinity: look for a span to the Arch-Router service
+- Requests **with** affinity (pinned): the Arch-Router span will be absent —
+  the decision was served from Redis without calling the router at all
+
+This is the clearest observable signal that the cache is working: pinned
+requests are noticeably faster and produce fewer spans.
+
+## Switching to the In-Memory Backend
+
+To compare against the default in-memory backend, change `config.yaml`:
+
+```yaml
+routing:
+  session_ttl_seconds: 300
+  session_cache:
+    type: memory     # ← change this
+```
+
+In-memory mode does **not** require Redis and works identically for a
+single Plano process. The difference only becomes visible when you run
+multiple replicas.
+
+## Teardown
+
+```bash
+./run_demo.sh down
+```
+
+This stops Plano, Redis, and Jaeger.
--- a/demos/llm_routing/session_affinity_redis/config.yaml
+++ b/demos/llm_routing/session_affinity_redis/config.yaml
@ -0,0 +1,36 @@
+version: v0.4.0
+
+listeners:
+  - type: model
+    name: model_listener
+    port: 12000
+
+model_providers:
+  - model: openai/gpt-4o-mini
+    access_key: $OPENAI_API_KEY
+    default: true
+
+  - model: openai/gpt-5.2
+    access_key: $OPENAI_API_KEY
+
+routing_preferences:
+  - name: fast_responses
+    description: short factual questions, quick lookups, simple summarization, or greetings
+    models:
+      - openai/gpt-4o-mini
+
+  - name: deep_reasoning
+    description: multi-step reasoning, complex analysis, code review, or detailed explanations
+    models:
+      - openai/gpt-5.2
+      - openai/gpt-4o-mini
+
+routing:
+  session_ttl_seconds: 300
+  session_cache:
+    type: redis
+    url: redis://localhost:6379
+
+tracing:
+  random_sampling: 100
+  trace_arch_internal: true
--- a/demos/llm_routing/session_affinity_redis/docker-compose.yaml
+++ b/demos/llm_routing/session_affinity_redis/docker-compose.yaml
@ -0,0 +1,23 @@
+services:
+  redis:
+    image: redis:7-alpine
+    container_name: plano-session-redis
+    restart: unless-stopped
+    ports:
+      - "6379:6379"
+    command: redis-server --save "" --appendonly no
+    healthcheck:
+      test: ["CMD", "redis-cli", "ping"]
+      interval: 1s
+      timeout: 1s
+      retries: 10
+
+  jaeger:
+    build:
+      context: ../../shared/jaeger
+    container_name: plano-session-jaeger
+    restart: unless-stopped
+    ports:
+      - "16686:16686"
+      - "4317:4317"
+      - "4318:4318"
--- a/demos/llm_routing/session_affinity_redis/run_demo.sh
+++ b/demos/llm_routing/session_affinity_redis/run_demo.sh
@ -0,0 +1,94 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+DEMO_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+
+load_env() {
+  if [ -f "$DEMO_DIR/.env" ]; then
+    set -a
+    # shellcheck disable=SC1091
+    source "$DEMO_DIR/.env"
+    set +a
+  fi
+}
+
+check_prereqs() {
+  local missing=()
+  command -v docker   >/dev/null 2>&1 || missing+=("docker")
+  command -v planoai  >/dev/null 2>&1 || missing+=("planoai (pip install planoai)")
+  if [ ${#missing[@]} -gt 0 ]; then
+    echo "ERROR: missing required tools: ${missing[*]}"
+    exit 1
+  fi
+
+  if [ -z "${OPENAI_API_KEY:-}" ]; then
+    echo "ERROR: OPENAI_API_KEY is not set."
+    echo "       Create a .env file or export the variable before running."
+    exit 1
+  fi
+}
+
+start_demo() {
+  echo "==> Starting Redis + Jaeger..."
+  docker compose -f "$DEMO_DIR/docker-compose.yaml" up -d
+
+  echo "==> Waiting for Redis to be ready..."
+  local retries=0
+  until docker exec plano-session-redis redis-cli ping 2>/dev/null | grep -q PONG; do
+    retries=$((retries + 1))
+    if [ $retries -ge 15 ]; then
+      echo "ERROR: Redis did not become ready in time"
+      exit 1
+    fi
+    sleep 1
+  done
+  echo "    Redis is ready."
+
+  echo "==> Starting Plano..."
+  planoai up "$DEMO_DIR/config.yaml"
+
+  echo ""
+  echo "Demo is running!"
+  echo ""
+  echo "  Model endpoint:  http://localhost:12000/v1/chat/completions"
+  echo "  Jaeger UI:       http://localhost:16686"
+  echo "  Redis:           localhost:6379"
+  echo ""
+  echo "Run the verification script to confirm session pinning:"
+  echo "  python $DEMO_DIR/verify_affinity.py"
+  echo ""
+  echo "Stop the demo with: $0 down"
+}
+
+stop_demo() {
+  echo "==> Stopping Plano..."
+  planoai down 2>/dev/null || true
+
+  echo "==> Stopping Docker services..."
+  docker compose -f "$DEMO_DIR/docker-compose.yaml" down
+
+  echo "Demo stopped."
+}
+
+usage() {
+  echo "Usage: $0 [up|down]"
+  echo ""
+  echo "  up    Start Redis, Jaeger, and Plano (default)"
+  echo "  down  Stop all services"
+}
+
+load_env
+
+case "${1:-up}" in
+  up)
+    check_prereqs
+    start_demo
+    ;;
+  down)
+    stop_demo
+    ;;
+  *)
+    usage
+    exit 1
+    ;;
+esac
--- a/demos/llm_routing/session_affinity_redis/verify_affinity.py
+++ b/demos/llm_routing/session_affinity_redis/verify_affinity.py
@ -0,0 +1,146 @@
+#!/usr/bin/env python3
+"""
+verify_affinity.py — Verify that model affinity (session pinning) works correctly.
+
+Sends multiple requests with the same X-Model-Affinity session ID and asserts
+that every response is served by the same model, demonstrating that Plano's
+session cache is working as expected.
+
+Usage:
+    python verify_affinity.py [--url URL] [--rounds N] [--sessions N]
+"""
+
+import argparse
+import json
+import sys
+import urllib.error
+import urllib.request
+from collections import defaultdict
+
+PLANO_URL = "http://localhost:12000/v1/chat/completions"
+
+PROMPTS = [
+    "What is 2 + 2?",
+    "Name the capital of France.",
+    "How many days in a week?",
+    "What color is the sky?",
+    "Who wrote Romeo and Juliet?",
+]
+
+MESSAGES_PER_SESSION = [{"role": "user", "content": prompt} for prompt in PROMPTS]
+
+
+def chat(url: str, session_id: str | None, message: str) -> dict:
+    payload = json.dumps(
+        {
+            "model": "openai/gpt-4o-mini",
+            "messages": [{"role": "user", "content": message}],
+        }
+    ).encode()
+
+    headers = {"Content-Type": "application/json"}
+    if session_id:
+        headers["x-model-affinity"] = session_id
+
+    req = urllib.request.Request(url, data=payload, headers=headers, method="POST")
+    try:
+        with urllib.request.urlopen(req, timeout=30) as resp:
+            return json.loads(resp.read())
+    except urllib.error.URLError as e:
+        print(f"  ERROR: could not reach Plano at {url}: {e}", file=sys.stderr)
+        print("  Is the demo running? Start it with: ./run_demo.sh up", file=sys.stderr)
+        sys.exit(1)
+
+
+def extract_model(response: dict) -> str:
+    return response.get("model", "<unknown>")
+
+
+def run_verification(url: str, rounds: int, num_sessions: int) -> bool:
+    print(f"Plano endpoint : {url}")
+    print(f"Sessions       : {num_sessions}")
+    print(f"Rounds/session : {rounds}")
+    print()
+
+    all_passed = True
+
+    # --- Phase 1: Requests without session ID ---
+    print("=" * 60)
+    print("Phase 1: Requests WITHOUT X-Model-Affinity header")
+    print("  (model may vary between requests — that is expected)")
+    print("=" * 60)
+    models_seen: set[str] = set()
+    for i in range(min(rounds, 3)):
+        resp = chat(url, None, PROMPTS[i % len(PROMPTS)])
+        model = extract_model(resp)
+        models_seen.add(model)
+        print(f"  Request {i + 1}: model = {model}")
+    print(f"  Models seen across {min(rounds, 3)} requests: {models_seen}")
+    print()
+
+    # --- Phase 2: Each session should always get the same model ---
+    print("=" * 60)
+    print("Phase 2: Requests WITH X-Model-Affinity (session pinning)")
+    print("  Each session should be pinned to exactly one model.")
+    print("=" * 60)
+
+    session_results: dict[str, list[str]] = defaultdict(list)
+
+    for s in range(num_sessions):
+        session_id = f"demo-session-{s + 1:03d}"
+        print(f"\n  Session '{session_id}':")
+
+        for r in range(rounds):
+            resp = chat(url, session_id, PROMPTS[r % len(PROMPTS)])
+            model = extract_model(resp)
+            session_results[session_id].append(model)
+            pinned = " [PINNED]" if r > 0 else " [FIRST — sets affinity]"
+            print(f"    Round {r + 1}: model = {model}{pinned}")
+
+    print()
+    print("=" * 60)
+    print("Results")
+    print("=" * 60)
+
+    for session_id, models in session_results.items():
+        unique_models = set(models)
+        if len(unique_models) == 1:
+            print(f"  PASS  {session_id} -> always routed to '{models[0]}'")
+        else:
+            print(
+                f"  FAIL  {session_id} -> inconsistent models across rounds: {unique_models}"
+            )
+            all_passed = False
+
+    print()
+    if all_passed:
+        print("All sessions were pinned consistently.")
+        print("Redis session cache is working correctly.")
+    else:
+        print("One or more sessions were NOT pinned consistently.")
+        print("Check that Redis is running and Plano is configured with:")
+        print("  routing:")
+        print("    session_cache:")
+        print("      type: redis")
+        print("      url: redis://localhost:6379")
+
+    return all_passed
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--url", default=PLANO_URL, help="Plano chat completions URL")
+    parser.add_argument(
+        "--rounds", type=int, default=4, help="Requests per session (default 4)"
+    )
+    parser.add_argument(
+        "--sessions", type=int, default=3, help="Number of sessions to test (default 3)"
+    )
+    args = parser.parse_args()
+
+    passed = run_verification(args.url, args.rounds, args.sessions)
+    sys.exit(0 if passed else 1)
+
+
+if __name__ == "__main__":
+    main()