remove session_affinity_redis and session_affinity_redis_k8s demos

2026-06-11 15:05:14 +02:00 · 2026-04-13 19:29:59 -07:00 · 2026-04-13 19:29:59 -07:00 · 05a85e9a85
commit 05a85e9a85
parent 03cb09f47e
20 changed files with 0 additions and 2080 deletions
--- a/demos/llm_routing/session_affinity_redis/.env.example
+++ b/demos/llm_routing/session_affinity_redis/.env.example
@ -1 +0,0 @@
-OPENAI_API_KEY=sk-replace-me
--- a/demos/llm_routing/session_affinity_redis/README.md
+++ b/demos/llm_routing/session_affinity_redis/README.md
@ -1,247 +0,0 @@
-# Session Affinity with Redis — Multi-Replica Model Pinning
-
-This demo shows Plano's **session affinity** (`X-Model-Affinity` header) backed by a **Redis session cache** instead of the default in-memory store.
-
-## The Problem
-
-By default, model affinity stores routing decisions in a per-process `HashMap`.
-This works for single-instance deployments, but breaks when you run multiple
-Plano replicas behind a load balancer:
-
-```
-Client ──► Load Balancer ──► Replica A  (session pinned here)
-                         └──► Replica B  (knows nothing about the session)
-```
-
-A request that was pinned to `gpt-4o` on Replica A will be re-routed from
-scratch on Replica B, defeating the purpose of affinity.
-
-## The Solution
-
-Plano's `session_cache` config key accepts a `type: redis` backend that is
-shared across all replicas:
-
-```yaml
-routing:
-  session_ttl_seconds: 300
-  session_cache:
-    type: redis
-    url: redis://localhost:6379
-```
-
-All replicas read and write the same Redis keyspace. A session pinned on any
-replica is immediately visible to all others.
-
-## What to Look For
-
-| What | Expected behaviour |
-|------|--------------------|
-| First request with a session ID | Plano routes normally (via Arch-Router) and writes the result to Redis (`SET session-id ... EX 300`) |
-| Subsequent requests with the **same** session ID | Plano reads from Redis and skips the router — same model every time |
-| Requests with a **different** session ID | Routed independently; may land on a different model |
-| After `session_ttl_seconds` elapses | Redis key expires; next request re-routes and sets a new pin |
-| `x-plano-pinned: true` response header | Tells you the response was served from the session cache |
-
-## Architecture
-
-```
-Client
-  │  X-Model-Affinity: my-session-id
-  ▼
-Plano (brightstaff)
-  ├── GET redis://localhost:6379/my-session-id
-  │     hit?  → return pinned model immediately (no Arch-Router call)
-  │     miss? → call Arch-Router → SET key EX 300 → return routed model
-  ▼
-Redis  (shared across replicas)
-```
-
-## Prerequisites
-
-| Requirement | Notes |
-|-------------|-------|
-| `planoai` CLI | `pip install planoai` |
-| Docker + Docker Compose | For Redis and Jaeger |
-| `OPENAI_API_KEY` | Required for routing model (Arch-Router) and downstream LLMs |
-| Python 3.11+ | Only needed to run `verify_affinity.py` |
-
-## Quick Start
-
-```bash
-# 1. Set your API key
-export OPENAI_API_KEY=sk-...
-# or copy and edit:
-cp .env.example .env
-
-# 2. Start Redis, Jaeger, and Plano
-./run_demo.sh up
-
-# 3. Verify session pinning works
-python verify_affinity.py
-```
-
-## Manual Verification with curl
-
-### Step 1 — Pin a session (first request sets the affinity)
-
-```bash
-curl -s http://localhost:12000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -H "x-model-affinity: my-session-abc" \
-  -d '{"model":"openai/gpt-4o-mini","messages":[{"role":"user","content":"Write a short poem about the ocean."}]}' \
-  | jq '{model, pinned: .x_plano_pinned}'
-```
-
-Expected output (first request — not yet pinned, Arch-Router picks the model):
-
-```json
-{
-  "model": "openai/gpt-5.2",
-  "pinned": null
-}
-```
-
-### Step 2 — Confirm the pin is held on subsequent requests
-
-```bash
-for i in 1 2 3 4; do
-  curl -s http://localhost:12000/v1/chat/completions \
-    -H "Content-Type: application/json" \
-    -H "x-model-affinity: my-session-abc" \
-    -d "{\"model\":\"openai/gpt-4o-mini\",\"messages\":[{\"role\":\"user\",\"content\":\"Request $i\"}]}" \
-    | jq -r '"\(.model)"'
-done
-```
-
-Expected output (same model for every request):
-
-```
-openai/gpt-5.2
-openai/gpt-5.2
-openai/gpt-5.2
-openai/gpt-5.2
-```
-
-### Step 3 — Inspect the Redis key directly
-
-```bash
-docker exec plano-session-redis redis-cli \
-  GET my-session-abc | python3 -m json.tool
-```
-
-Expected output:
-
-```json
-{
-  "model_name": "openai/gpt-5.2",
-  "route_name": "deep_reasoning"
-}
-```
-
-```bash
-# Check the TTL (seconds remaining)
-docker exec plano-session-redis redis-cli TTL my-session-abc
-# e.g. 287
-```
-
-### Step 4 — Different sessions may get different models
-
-```bash
-for session in session-A session-B session-C; do
-  model=$(curl -s http://localhost:12000/v1/chat/completions \
-    -H "Content-Type: application/json" \
-    -H "x-model-affinity: $session" \
-    -d '{"model":"openai/gpt-4o-mini","messages":[{"role":"user","content":"Explain quantum entanglement in detail with equations."}]}' \
-    | jq -r '.model')
-  echo "$session -> $model"
-done
-```
-
-Sessions with content matched to `deep_reasoning` will pin to `openai/gpt-5.2`;
-sessions matched to `fast_responses` will pin to `openai/gpt-4o-mini`.
-
-## Verification Script Output
-
-Running `python verify_affinity.py` produces output like:
-
-```
-Plano endpoint : http://localhost:12000/v1/chat/completions
-Sessions       : 3
-Rounds/session : 4
-
-============================================================
-Phase 1: Requests WITHOUT X-Model-Affinity header
-  (model may vary between requests — that is expected)
-============================================================
-  Request 1: model = openai/gpt-4o-mini
-  Request 2: model = openai/gpt-5.2
-  Request 3: model = openai/gpt-4o-mini
-  Models seen across 3 requests: {'openai/gpt-4o-mini', 'openai/gpt-5.2'}
-
-============================================================
-Phase 2: Requests WITH X-Model-Affinity (session pinning)
-  Each session should be pinned to exactly one model.
-============================================================
-
-  Session 'demo-session-001':
-    Round 1: model = openai/gpt-4o-mini  [FIRST — sets affinity]
-    Round 2: model = openai/gpt-4o-mini  [PINNED]
-    Round 3: model = openai/gpt-4o-mini  [PINNED]
-    Round 4: model = openai/gpt-4o-mini  [PINNED]
-
-  Session 'demo-session-002':
-    Round 1: model = openai/gpt-5.2       [FIRST — sets affinity]
-    Round 2: model = openai/gpt-5.2       [PINNED]
-    Round 3: model = openai/gpt-5.2       [PINNED]
-    Round 4: model = openai/gpt-5.2       [PINNED]
-
-  Session 'demo-session-003':
-    Round 1: model = openai/gpt-4o-mini  [FIRST — sets affinity]
-    Round 2: model = openai/gpt-4o-mini  [PINNED]
-    Round 3: model = openai/gpt-4o-mini  [PINNED]
-    Round 4: model = openai/gpt-4o-mini  [PINNED]
-
-============================================================
-Results
-============================================================
-  PASS  demo-session-001 -> always routed to 'openai/gpt-4o-mini'
-  PASS  demo-session-002 -> always routed to 'openai/gpt-5.2'
-  PASS  demo-session-003 -> always routed to 'openai/gpt-4o-mini'
-
-All sessions were pinned consistently.
-Redis session cache is working correctly.
-```
-
-## Observability
-
-Open Jaeger at **http://localhost:16686** and select service `plano`.
-
- Requests **without** affinity: look for a span to the Arch-Router service
- Requests **with** affinity (pinned): the Arch-Router span will be absent —
-  the decision was served from Redis without calling the router at all
-
-This is the clearest observable signal that the cache is working: pinned
-requests are noticeably faster and produce fewer spans.
-
-## Switching to the In-Memory Backend
-
-To compare against the default in-memory backend, change `config.yaml`:
-
-```yaml
-routing:
-  session_ttl_seconds: 300
-  session_cache:
-    type: memory     # ← change this
-```
-
-In-memory mode does **not** require Redis and works identically for a
-single Plano process. The difference only becomes visible when you run
-multiple replicas.
-
-## Teardown
-
-```bash
-./run_demo.sh down
-```
-
-This stops Plano, Redis, and Jaeger.
--- a/demos/llm_routing/session_affinity_redis/config.yaml
+++ b/demos/llm_routing/session_affinity_redis/config.yaml
@ -1,36 +0,0 @@
-version: v0.4.0
-
-listeners:
-  - type: model
-    name: model_listener
-    port: 12000
-
-model_providers:
-  - model: openai/gpt-4o-mini
-    access_key: $OPENAI_API_KEY
-    default: true
-
-  - model: openai/gpt-5.2
-    access_key: $OPENAI_API_KEY
-
-routing_preferences:
-  - name: fast_responses
-    description: short factual questions, quick lookups, simple summarization, or greetings
-    models:
-      - openai/gpt-4o-mini
-
-  - name: deep_reasoning
-    description: multi-step reasoning, complex analysis, code review, or detailed explanations
-    models:
-      - openai/gpt-5.2
-      - openai/gpt-4o-mini
-
-routing:
-  session_ttl_seconds: 300
-  session_cache:
-    type: redis
-    url: redis://localhost:6379
-
-tracing:
-  random_sampling: 100
-  trace_arch_internal: true
--- a/demos/llm_routing/session_affinity_redis/docker-compose.yaml
+++ b/demos/llm_routing/session_affinity_redis/docker-compose.yaml
@ -1,23 +0,0 @@
-services:
-  redis:
-    image: redis:7-alpine
-    container_name: plano-session-redis
-    restart: unless-stopped
-    ports:
-      - "6379:6379"
-    command: redis-server --save "" --appendonly no
-    healthcheck:
-      test: ["CMD", "redis-cli", "ping"]
-      interval: 1s
-      timeout: 1s
-      retries: 10
-
-  jaeger:
-    build:
-      context: ../../shared/jaeger
-    container_name: plano-session-jaeger
-    restart: unless-stopped
-    ports:
-      - "16686:16686"
-      - "4317:4317"
-      - "4318:4318"
--- a/demos/llm_routing/session_affinity_redis/run_demo.sh
+++ b/demos/llm_routing/session_affinity_redis/run_demo.sh
@ -1,94 +0,0 @@
-#!/usr/bin/env bash
-set -euo pipefail
-
-DEMO_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-
-load_env() {
-  if [ -f "$DEMO_DIR/.env" ]; then
-    set -a
-    # shellcheck disable=SC1091
-    source "$DEMO_DIR/.env"
-    set +a
-  fi
-}
-
-check_prereqs() {
-  local missing=()
-  command -v docker   >/dev/null 2>&1 || missing+=("docker")
-  command -v planoai  >/dev/null 2>&1 || missing+=("planoai (pip install planoai)")
-  if [ ${#missing[@]} -gt 0 ]; then
-    echo "ERROR: missing required tools: ${missing[*]}"
-    exit 1
-  fi
-
-  if [ -z "${OPENAI_API_KEY:-}" ]; then
-    echo "ERROR: OPENAI_API_KEY is not set."
-    echo "       Create a .env file or export the variable before running."
-    exit 1
-  fi
-}
-
-start_demo() {
-  echo "==> Starting Redis + Jaeger..."
-  docker compose -f "$DEMO_DIR/docker-compose.yaml" up -d
-
-  echo "==> Waiting for Redis to be ready..."
-  local retries=0
-  until docker exec plano-session-redis redis-cli ping 2>/dev/null | grep -q PONG; do
-    retries=$((retries + 1))
-    if [ $retries -ge 15 ]; then
-      echo "ERROR: Redis did not become ready in time"
-      exit 1
-    fi
-    sleep 1
-  done
-  echo "    Redis is ready."
-
-  echo "==> Starting Plano..."
-  planoai up "$DEMO_DIR/config.yaml"
-
-  echo ""
-  echo "Demo is running!"
-  echo ""
-  echo "  Model endpoint:  http://localhost:12000/v1/chat/completions"
-  echo "  Jaeger UI:       http://localhost:16686"
-  echo "  Redis:           localhost:6379"
-  echo ""
-  echo "Run the verification script to confirm session pinning:"
-  echo "  python $DEMO_DIR/verify_affinity.py"
-  echo ""
-  echo "Stop the demo with: $0 down"
-}
-
-stop_demo() {
-  echo "==> Stopping Plano..."
-  planoai down 2>/dev/null || true
-
-  echo "==> Stopping Docker services..."
-  docker compose -f "$DEMO_DIR/docker-compose.yaml" down
-
-  echo "Demo stopped."
-}
-
-usage() {
-  echo "Usage: $0 [up|down]"
-  echo ""
-  echo "  up    Start Redis, Jaeger, and Plano (default)"
-  echo "  down  Stop all services"
-}
-
-load_env
-
-case "${1:-up}" in
-  up)
-    check_prereqs
-    start_demo
-    ;;
-  down)
-    stop_demo
-    ;;
-  *)
-    usage
-    exit 1
-    ;;
-esac
--- a/demos/llm_routing/session_affinity_redis/verify_affinity.py
+++ b/demos/llm_routing/session_affinity_redis/verify_affinity.py
@ -1,146 +0,0 @@
-#!/usr/bin/env python3
-"""
-verify_affinity.py — Verify that model affinity (session pinning) works correctly.
-
-Sends multiple requests with the same X-Model-Affinity session ID and asserts
-that every response is served by the same model, demonstrating that Plano's
-session cache is working as expected.
-
-Usage:
-    python verify_affinity.py [--url URL] [--rounds N] [--sessions N]
-"""
-
-import argparse
-import json
-import sys
-import urllib.error
-import urllib.request
-from collections import defaultdict
-
-PLANO_URL = "http://localhost:12000/v1/chat/completions"
-
-PROMPTS = [
-    "What is 2 + 2?",
-    "Name the capital of France.",
-    "How many days in a week?",
-    "What color is the sky?",
-    "Who wrote Romeo and Juliet?",
-]
-
-MESSAGES_PER_SESSION = [{"role": "user", "content": prompt} for prompt in PROMPTS]
-
-
-def chat(url: str, session_id: str | None, message: str) -> dict:
-    payload = json.dumps(
-        {
-            "model": "openai/gpt-4o-mini",
-            "messages": [{"role": "user", "content": message}],
-        }
-    ).encode()
-
-    headers = {"Content-Type": "application/json"}
-    if session_id:
-        headers["x-model-affinity"] = session_id
-
-    req = urllib.request.Request(url, data=payload, headers=headers, method="POST")
-    try:
-        with urllib.request.urlopen(req, timeout=30) as resp:
-            return json.loads(resp.read())
-    except urllib.error.URLError as e:
-        print(f"  ERROR: could not reach Plano at {url}: {e}", file=sys.stderr)
-        print("  Is the demo running? Start it with: ./run_demo.sh up", file=sys.stderr)
-        sys.exit(1)
-
-
-def extract_model(response: dict) -> str:
-    return response.get("model", "<unknown>")
-
-
-def run_verification(url: str, rounds: int, num_sessions: int) -> bool:
-    print(f"Plano endpoint : {url}")
-    print(f"Sessions       : {num_sessions}")
-    print(f"Rounds/session : {rounds}")
-    print()
-
-    all_passed = True
-
-    # --- Phase 1: Requests without session ID ---
-    print("=" * 60)
-    print("Phase 1: Requests WITHOUT X-Model-Affinity header")
-    print("  (model may vary between requests — that is expected)")
-    print("=" * 60)
-    models_seen: set[str] = set()
-    for i in range(min(rounds, 3)):
-        resp = chat(url, None, PROMPTS[i % len(PROMPTS)])
-        model = extract_model(resp)
-        models_seen.add(model)
-        print(f"  Request {i + 1}: model = {model}")
-    print(f"  Models seen across {min(rounds, 3)} requests: {models_seen}")
-    print()
-
-    # --- Phase 2: Each session should always get the same model ---
-    print("=" * 60)
-    print("Phase 2: Requests WITH X-Model-Affinity (session pinning)")
-    print("  Each session should be pinned to exactly one model.")
-    print("=" * 60)
-
-    session_results: dict[str, list[str]] = defaultdict(list)
-
-    for s in range(num_sessions):
-        session_id = f"demo-session-{s + 1:03d}"
-        print(f"\n  Session '{session_id}':")
-
-        for r in range(rounds):
-            resp = chat(url, session_id, PROMPTS[r % len(PROMPTS)])
-            model = extract_model(resp)
-            session_results[session_id].append(model)
-            pinned = " [PINNED]" if r > 0 else " [FIRST — sets affinity]"
-            print(f"    Round {r + 1}: model = {model}{pinned}")
-
-    print()
-    print("=" * 60)
-    print("Results")
-    print("=" * 60)
-
-    for session_id, models in session_results.items():
-        unique_models = set(models)
-        if len(unique_models) == 1:
-            print(f"  PASS  {session_id} -> always routed to '{models[0]}'")
-        else:
-            print(
-                f"  FAIL  {session_id} -> inconsistent models across rounds: {unique_models}"
-            )
-            all_passed = False
-
-    print()
-    if all_passed:
-        print("All sessions were pinned consistently.")
-        print("Redis session cache is working correctly.")
-    else:
-        print("One or more sessions were NOT pinned consistently.")
-        print("Check that Redis is running and Plano is configured with:")
-        print("  routing:")
-        print("    session_cache:")
-        print("      type: redis")
-        print("      url: redis://localhost:6379")
-
-    return all_passed
-
-
-def main() -> None:
-    parser = argparse.ArgumentParser(description=__doc__)
-    parser.add_argument("--url", default=PLANO_URL, help="Plano chat completions URL")
-    parser.add_argument(
-        "--rounds", type=int, default=4, help="Requests per session (default 4)"
-    )
-    parser.add_argument(
-        "--sessions", type=int, default=3, help="Number of sessions to test (default 3)"
-    )
-    args = parser.parse_args()
-
-    passed = run_verification(args.url, args.rounds, args.sessions)
-    sys.exit(0 if passed else 1)
-
-
-if __name__ == "__main__":
-    main()
--- a/demos/llm_routing/session_affinity_redis_k8s/.env.example
+++ b/demos/llm_routing/session_affinity_redis_k8s/.env.example
@ -1 +0,0 @@
-OPENAI_API_KEY=sk-replace-me
--- a/demos/llm_routing/session_affinity_redis_k8s/Dockerfile
+++ b/demos/llm_routing/session_affinity_redis_k8s/Dockerfile
@ -1,95 +0,0 @@
-# Plano image for Redis-backed session affinity demo.
-# Build context must be the repository root:
-#   docker build -f demos/llm_routing/session_affinity_redis_k8s/Dockerfile -t <your-image> .
-
-# Envoy version — keep in sync with cli/planoai/consts.py ENVOY_VERSION
-ARG ENVOY_VERSION=v1.37.0
-
-# --- Dependency cache ---
-FROM rust:1.93.0 AS deps
-RUN rustup -v target add wasm32-wasip1
-WORKDIR /arch
-
-COPY crates/Cargo.toml crates/Cargo.lock ./
-COPY crates/common/Cargo.toml         common/Cargo.toml
-COPY crates/hermesllm/Cargo.toml      hermesllm/Cargo.toml
-COPY crates/prompt_gateway/Cargo.toml prompt_gateway/Cargo.toml
-COPY crates/llm_gateway/Cargo.toml    llm_gateway/Cargo.toml
-COPY crates/brightstaff/Cargo.toml    brightstaff/Cargo.toml
-
-RUN mkdir -p common/src && echo "" > common/src/lib.rs && \
-    mkdir -p hermesllm/src && echo "" > hermesllm/src/lib.rs && \
-    mkdir -p hermesllm/src/bin && echo "fn main() {}" > hermesllm/src/bin/fetch_models.rs && \
-    mkdir -p prompt_gateway/src && echo "#[no_mangle] pub fn _start() {}" > prompt_gateway/src/lib.rs && \
-    mkdir -p llm_gateway/src && echo "#[no_mangle] pub fn _start() {}" > llm_gateway/src/lib.rs && \
-    mkdir -p brightstaff/src && echo "fn main() {}" > brightstaff/src/main.rs && echo "" > brightstaff/src/lib.rs
-
-RUN cargo build --release --target wasm32-wasip1 -p prompt_gateway -p llm_gateway || true
-RUN cargo build --release -p brightstaff || true
-
-# --- WASM plugins ---
-FROM deps AS wasm-builder
-RUN rm -rf common/src hermesllm/src prompt_gateway/src llm_gateway/src
-COPY crates/common/src         common/src
-COPY crates/hermesllm/src      hermesllm/src
-COPY crates/prompt_gateway/src prompt_gateway/src
-COPY crates/llm_gateway/src    llm_gateway/src
-RUN find common hermesllm prompt_gateway llm_gateway -name "*.rs" -exec touch {} +
-RUN cargo build --release --target wasm32-wasip1 -p prompt_gateway -p llm_gateway
-
-# --- Brightstaff binary ---
-FROM deps AS brightstaff-builder
-RUN rm -rf common/src hermesllm/src brightstaff/src
-COPY crates/common/src         common/src
-COPY crates/hermesllm/src      hermesllm/src
-COPY crates/brightstaff/src    brightstaff/src
-RUN find common hermesllm brightstaff -name "*.rs" -exec touch {} +
-RUN cargo build --release -p brightstaff
-
-FROM docker.io/envoyproxy/envoy:${ENVOY_VERSION} AS envoy
-
-FROM python:3.14-slim AS arch
-
-RUN set -eux; \
-  apt-get update; \
-  apt-get upgrade -y; \
-  apt-get install -y --no-install-recommends gettext-base curl procps; \
-  apt-get clean; rm -rf /var/lib/apt/lists/*
-
-RUN pip install --no-cache-dir supervisor
-
-RUN set -eux; \
-  dpkg -r --force-depends libpam-modules libpam-modules-bin libpam-runtime libpam0g || true; \
-  dpkg -P --force-all libpam-modules libpam-modules-bin libpam-runtime libpam0g || true; \
-  rm -rf /etc/pam.d /lib/*/security /usr/lib/security || true
-
-COPY --from=envoy /usr/local/bin/envoy /usr/local/bin/envoy
-
-WORKDIR /app
-
-RUN pip install --no-cache-dir uv
-
-COPY cli/pyproject.toml ./
-COPY cli/uv.lock ./
-COPY cli/README.md ./
-COPY config/plano_config_schema.yaml /config/plano_config_schema.yaml
-COPY config/envoy.template.yaml /config/envoy.template.yaml
-
-RUN pip install --no-cache-dir -e .
-
-COPY cli/planoai planoai/
-COPY config/envoy.template.yaml .
-COPY config/plano_config_schema.yaml .
-RUN mkdir -p /etc/supervisor/conf.d
-COPY config/supervisord.conf /etc/supervisor/conf.d/supervisord.conf
-
-COPY --from=wasm-builder /arch/target/wasm32-wasip1/release/prompt_gateway.wasm /etc/envoy/proxy-wasm-plugins/prompt_gateway.wasm
-COPY --from=wasm-builder /arch/target/wasm32-wasip1/release/llm_gateway.wasm /etc/envoy/proxy-wasm-plugins/llm_gateway.wasm
-COPY --from=brightstaff-builder /arch/target/release/brightstaff /app/brightstaff
-
-RUN mkdir -p /var/log/supervisor && \
-    touch /var/log/envoy.log /var/log/supervisor/supervisord.log \
-          /var/log/access_ingress.log /var/log/access_ingress_prompt.log \
-          /var/log/access_internal.log /var/log/access_llm.log /var/log/access_agent.log
-
-ENTRYPOINT ["/usr/local/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]
--- a/demos/llm_routing/session_affinity_redis_k8s/README.md
+++ b/demos/llm_routing/session_affinity_redis_k8s/README.md
@ -1,287 +0,0 @@
-# Session Affinity — Multi-Replica Kubernetes Deployment
-
-Production-style Kubernetes demo that proves Redis-backed session affinity
-(`X-Model-Affinity`) works correctly when Plano runs as multiple replicas
-behind a load balancer.
-
-## Architecture
-
-```
-                     ┌─────────────────────────────────────────┐
-                     │           Kubernetes Cluster             │
-                     │                                          │
-   Client ──────────►│  LoadBalancer Service (port 12000)       │
-                     │       │               │                  │
-                     │  ┌────▼────┐    ┌─────▼───┐             │
-                     │  │ Plano   │    │ Plano   │  (replicas) │
-                     │  │ Pod 0   │    │ Pod 1   │             │
-                     │  └────┬────┘    └────┬────┘             │
-                     │       └──────┬───────┘                  │
-                     │         ┌────▼────┐                     │
-                     │         │  Redis  │ (StatefulSet)        │
-                     │         │ Pod     │ shared session store │
-                     │         └─────────┘                     │
-                     │                                          │
-                     │  ┌──────────┐                           │
-                     │  │  Jaeger  │  distributed tracing       │
-                     │  └──────────┘                           │
-                     └─────────────────────────────────────────┘
-```
-
-**What makes this production-like:**
-
-| Feature | Detail |
-|---------|--------|
-| 2 Plano replicas | `replicas: 2` with HPA (scales 2–5 on CPU) |
-| Shared Redis | StatefulSet with PVC — sessions survive pod restarts |
-| Session TTL | 600 s, enforced natively by Redis `EX` |
-| Eviction policy | `allkeys-lru` — Redis auto-evicts oldest sessions under memory pressure |
-| Distributed tracing | Jaeger collects spans from both pods |
-| Health probes | Readiness + liveness gates traffic away from unhealthy pods |
-
-## Quick Start (local — no registry needed)
-
-```bash
-# 1. Install kind if needed
-#    https://kind.sigs.k8s.io/docs/user/quick-start/#installation
-#    brew install kind   (macOS)
-
-# 2. Set your API key
-export OPENAI_API_KEY=sk-...
-# or copy and edit:
-cp .env.example .env
-
-# 3. Build, deploy, and verify in one command
-./run-local.sh
-```
-
-`run-local.sh` creates a kind cluster named `plano-demo` (if it doesn't exist),
-builds the image locally, loads it into the cluster with `kind load docker-image`
-— **no registry, no push required**.
-
-Individual steps:
-
-```bash
-./run-local.sh --build-only      # (re-)build and reload image into kind
-./run-local.sh --deploy-only     # (re-)apply k8s manifests
-./run-local.sh --verify          # run verify_affinity.py
-./run-local.sh --down            # delete k8s resources (keeps kind cluster)
-./run-local.sh --delete-cluster  # delete k8s resources + kind cluster
-```
-
---
-
-## Prerequisites
-
-| Tool | Notes |
-|------|-------|
-| `kubectl` | Configured to reach a Kubernetes cluster |
-| `docker` | To build and push the custom image |
-| Container registry (optional) | Needed only when you are not using the local kind flow |
-| `OPENAI_API_KEY` | For model inference |
-| Python 3.11+ | Only for `verify_affinity.py` |
-
-**Cluster:** `run-local.sh` creates and manages a kind cluster named `plano-demo` automatically. Install kind from https://kind.sigs.k8s.io or `brew install kind`.
-
-## Step 1 — Build the Image
-
-Build a custom image from the repo root:
-
-```bash
-# From this demo directory:
-./build-and-push.sh ghcr.io/yourorg/plano-redis:latest
-
-# Or manually from the repo root:
-docker build \
-  -f demos/llm_routing/session_affinity_redis_k8s/Dockerfile \
-  -t ghcr.io/yourorg/plano-redis:latest \
-  .
-docker push ghcr.io/yourorg/plano-redis:latest
-```
-
-Then update the image reference in `k8s/plano.yaml` (skip this when using `run-local.sh`, which uses `plano-redis:local` automatically):
-
-```yaml
-image: ghcr.io/yourorg/plano-redis:latest  # ← replace YOUR_REGISTRY/plano-redis:latest
-```
-
-## Step 2 — Deploy
-
-```bash
-./deploy.sh
-```
-
-The script:
-1. Creates the `plano-demo` namespace
-2. Prompts for `OPENAI_API_KEY` and creates a Kubernetes Secret
-3. Applies Redis, Jaeger, ConfigMap, and Plano manifests in order
-4. Waits for rollouts to complete
-
-Expected output:
-
-```
-==> Applying namespace...
-==> Creating API key secret...
-  OPENAI_API_KEY: [hidden]
-==> Applying Redis (StatefulSet + Services)...
-==> Applying Jaeger...
-==> Applying Plano config (ConfigMap)...
-==> Applying Plano deployment + HPA...
-==> Waiting for Redis to be ready...
-==> Waiting for Plano pods to be ready...
-
-Deployment complete!
-
-=== Pods ===
-NAME                     READY   STATUS    NODE
-redis-0                  1/1     Running   node-1
-plano-6d8f9b-xk2pq       1/1     Running   node-1
-plano-6d8f9b-r7nlw       1/1     Running   node-2
-jaeger-5c7d8f-q9mnb      1/1     Running   node-1
-
-=== Services ===
-NAME      TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)
-plano     LoadBalancer   10.96.12.50    203.0.113.42   12000:32000/TCP
-redis     ClusterIP      None           <none>          6379/TCP
-jaeger    ClusterIP      10.96.8.71     <none>          16686/TCP,...
-```
-
-## Step 3 — Verify Session Affinity Across Replicas
-
-```bash
-python verify_affinity.py
-```
-
-The script opens a dedicated `kubectl port-forward` tunnel to **each pod
-individually**. This is the definitive test: it routes requests to specific
-pods rather than relying on random load-balancer assignment.
-
-```
-Mode: per-pod port-forward (full cross-replica proof)
-
-Found 2 Plano pod(s): plano-6d8f9b-xk2pq, plano-6d8f9b-r7nlw
-Opening per-pod port-forward tunnels...
-
-  plano-6d8f9b-xk2pq → localhost:19100
-  plano-6d8f9b-r7nlw → localhost:19101
-
-==================================================================
-Phase 1: Cross-replica session pinning
-  Pods under test : plano-6d8f9b-xk2pq, plano-6d8f9b-r7nlw
-  Sessions        : 4
-  Rounds/session  : 4
-
-  Each session is PINNED via one pod and VERIFIED via another.
-  If Redis is shared, every round must return the same model.
-==================================================================
-
-  PASS  k8s-session-001
-        model      : gpt-4o-mini-2024-07-18
-        pod order  : plano-6d8f9b-xk2pq → plano-6d8f9b-r7nlw → plano-6d8f9b-xk2pq → plano-6d8f9b-r7nlw
-
-  PASS  k8s-session-002
-        model      : gpt-5.2
-        pod order  : plano-6d8f9b-r7nlw → plano-6d8f9b-xk2pq → plano-6d8f9b-r7nlw → plano-6d8f9b-xk2pq
-
-  PASS  k8s-session-003
-        model      : gpt-4o-mini-2024-07-18
-        pod order  : plano-6d8f9b-xk2pq → plano-6d8f9b-r7nlw → plano-6d8f9b-xk2pq → plano-6d8f9b-r7nlw
-
-  PASS  k8s-session-004
-        model      : gpt-5.2
-        pod order  : plano-6d8f9b-r7nlw → plano-6d8f9b-xk2pq → plano-6d8f9b-r7nlw → plano-6d8f9b-xk2pq
-
-==================================================================
-Phase 2: Redis key inspection
-==================================================================
-  k8s-session-001
-    model_name : gpt-4o-mini-2024-07-18
-    route_name : fast_responses
-    TTL        : 587s remaining
-  k8s-session-002
-    model_name : gpt-5.2
-    route_name : deep_reasoning
-    TTL        : 581s remaining
-  ...
-
-==================================================================
-Summary
-==================================================================
-All sessions were pinned consistently across replicas.
-Redis session cache is working correctly in Kubernetes.
-```
-
-## What to Look For
-
-### The cross-replica proof
-
-Each session's `pod order` line shows it alternating between the two pods:
-
-```
-pod order: pod-A → pod-B → pod-A → pod-B
-```
-
-Round 1 sets the Redis key (via pod-A). Rounds 2, 3, 4 read from Redis on
-alternating pods. If the model stays the same across all rounds, Redis is the
-shared source of truth — **not** any in-process state.
-
-### Redis keys
-
-```bash
-kubectl exec -it redis-0 -n plano-demo -- redis-cli
-
-127.0.0.1:6379> KEYS *
-1) "k8s-session-001"
-2) "k8s-session-002"
-
-127.0.0.1:6379> GET k8s-session-001
-{"model_name":"gpt-4o-mini-2024-07-18","route_name":"fast_responses"}
-
-127.0.0.1:6379> TTL k8s-session-001
-(integer) 587
-```
-
-### Jaeger traces
-
-```bash
-kubectl port-forward svc/jaeger 16686:16686 -n plano-demo
-```
-
-Open **http://localhost:16686**, select service `plano`.
-
- **Pinned requests** — no span to the Arch-Router (decision served from Redis)
- **First request** per session — spans include the router call + a Redis `SET`
- Both Plano pods appear as separate instances in the trace list
-
-### Scaling up (HPA in action)
-
-```bash
-# Scale to 3 replicas manually
-kubectl scale deployment/plano --replicas=3 -n plano-demo
-
-# Run verification again — now 3 pods alternate
-python verify_affinity.py --sessions 6
-```
-
-Existing sessions in Redis are unaffected by the scale event. New pods
-immediately participate in the shared session pool.
-
-## Teardown
-
-```bash
-./deploy.sh --destroy
-# Then optionally:
-kubectl delete namespace plano-demo
-```
-
-## Notes
-
- The Redis StatefulSet uses a `PersistentVolumeClaim`. Session data survives
-  pod restarts within a TTL window but is not HA. For production, replace with
-  Redis Sentinel, Redis Cluster, or a managed service (ElastiCache, MemoryStore).
- `session_max_entries` is not enforced by this backend — Redis uses
-  `maxmemory-policy: allkeys-lru` instead, which is a global limit across all
-  keys rather than a per-application cap.
- On **minikube**, run `minikube tunnel` in a separate terminal to get an
-  external IP for the LoadBalancer service.
- On **kind**, switch to `NodePort` (see the comment in `k8s/plano.yaml`).
--- a/demos/llm_routing/session_affinity_redis_k8s/build-and-push.sh
+++ b/demos/llm_routing/session_affinity_redis_k8s/build-and-push.sh
@ -1,47 +0,0 @@
-#!/usr/bin/env bash
-# build-and-push.sh — Build the Plano demo image and push it to your registry.
-#
-# Usage:
-#   ./build-and-push.sh <registry/image:tag>
-#
-# Example:
-#   ./build-and-push.sh ghcr.io/yourorg/plano-redis:latest
-#   ./build-and-push.sh docker.io/youruser/plano-redis:0.4.17
-#
-# The build context is the repository root. Run this script from anywhere —
-# it resolves the repo root automatically.
-
-set -euo pipefail
-
-IMAGE="${1:-}"
-if [ -z "$IMAGE" ]; then
-  echo "Usage: $0 <registry/image:tag>"
-  echo ""
-  echo "Example:"
-  echo "  $0 ghcr.io/yourorg/plano-redis:latest"
-  exit 1
-fi
-
-REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../../.." && pwd)"
-DOCKERFILE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/Dockerfile"
-
-echo "Repository root : $REPO_ROOT"
-echo "Dockerfile      : $DOCKERFILE"
-echo "Image           : $IMAGE"
-echo ""
-
-echo "==> Building image (this takes a few minutes — Rust compile from scratch)..."
-docker build \
-  --file "$DOCKERFILE" \
-  --tag "$IMAGE" \
-  --progress=plain \
-  "$REPO_ROOT"
-
-echo "==> Pushing $IMAGE..."
-docker push "$IMAGE"
-
-echo ""
-echo "Done. Update k8s/plano.yaml:"
-echo "  image: $IMAGE"
-echo ""
-echo "Then deploy with: ./deploy.sh"
--- a/demos/llm_routing/session_affinity_redis_k8s/config_k8s.yaml
+++ b/demos/llm_routing/session_affinity_redis_k8s/config_k8s.yaml
@ -1,38 +0,0 @@
-version: v0.4.0
-
-listeners:
-  - type: model
-    name: model_listener
-    port: 12000
-
-model_providers:
-  - model: openai/gpt-4o-mini
-    access_key: $OPENAI_API_KEY
-    default: true
-
-  - model: openai/gpt-5.2
-    access_key: $OPENAI_API_KEY
-
-routing_preferences:
-  - name: fast_responses
-    description: short factual questions, quick lookups, simple summarization, or greetings
-    models:
-      - openai/gpt-4o-mini
-
-  - name: deep_reasoning
-    description: multi-step reasoning, complex analysis, code review, or detailed explanations
-    models:
-      - openai/gpt-5.2
-      - openai/gpt-4o-mini
-
-# Redis is reachable inside the cluster via the service name.
-routing:
-  session_ttl_seconds: 600
-  session_cache:
-    type: redis
-    url: redis://redis.plano-demo.svc.cluster.local:6379
-
-tracing:
-  random_sampling: 100
-  trace_arch_internal: true
-  opentracing_grpc_endpoint: http://jaeger.plano-demo.svc.cluster.local:4317
--- a/demos/llm_routing/session_affinity_redis_k8s/deploy.sh
+++ b/demos/llm_routing/session_affinity_redis_k8s/deploy.sh
@ -1,136 +0,0 @@
-#!/usr/bin/env bash
-# deploy.sh — Apply all Kubernetes manifests in the correct order.
-#
-# Usage:
-#   ./deploy.sh            # deploy everything
-#   ./deploy.sh --destroy  # tear down everything (keeps namespace for safety)
-#   ./deploy.sh --status   # show pod and service status
-
-set -euo pipefail
-
-DEMO_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-K8S_DIR="$DEMO_DIR/k8s"
-NS="plano-demo"
-
-check_prereqs() {
-  local missing=()
-  command -v kubectl >/dev/null 2>&1 || missing+=("kubectl")
-  if [ ${#missing[@]} -gt 0 ]; then
-    echo "ERROR: missing required tools: ${missing[*]}"
-    exit 1
-  fi
-
-  if ! kubectl cluster-info &>/dev/null; then
-    echo "ERROR: no Kubernetes cluster reachable. Start minikube/kind or configure kubeconfig."
-    exit 1
-  fi
-}
-
-create_secret() {
-  if kubectl get secret plano-secrets -n "$NS" &>/dev/null; then
-    echo "    Secret plano-secrets already exists, skipping."
-    return
-  fi
-
-  local openai_api_key="${OPENAI_API_KEY:-}"
-  if [ -z "$openai_api_key" ]; then
-    echo ""
-    echo "No 'plano-secrets' secret found in namespace '$NS'."
-    echo "Enter API keys (input is hidden):"
-    echo ""
-    read -r -s -p "  OPENAI_API_KEY: " openai_api_key
-    echo ""
-  else
-    echo "    Using OPENAI_API_KEY from environment."
-  fi
-
-  if [ -z "$openai_api_key" ]; then
-    echo "ERROR: OPENAI_API_KEY cannot be empty."
-    exit 1
-  fi
-
-  kubectl create secret generic plano-secrets \
-    --from-literal=OPENAI_API_KEY="$openai_api_key" \
-    -n "$NS"
-
-  echo "    Secret created."
-}
-
-deploy() {
-  echo "==> Applying namespace..."
-  kubectl apply -f "$K8S_DIR/namespace.yaml"
-
-  echo "==> Creating API key secret..."
-  create_secret
-
-  echo "==> Applying Redis (StatefulSet + Services)..."
-  kubectl apply -f "$K8S_DIR/redis.yaml"
-
-  echo "==> Applying Jaeger..."
-  kubectl apply -f "$K8S_DIR/jaeger.yaml"
-
-  echo "==> Applying Plano config (ConfigMap)..."
-  kubectl apply -f "$K8S_DIR/plano-config.yaml"
-
-  echo "==> Applying Plano deployment + HPA..."
-  kubectl apply -f "$K8S_DIR/plano.yaml"
-
-  echo ""
-  echo "==> Waiting for Redis to be ready..."
-  kubectl rollout status statefulset/redis -n "$NS" --timeout=120s
-
-  echo "==> Waiting for Plano pods to be ready..."
-  kubectl rollout status deployment/plano -n "$NS" --timeout=120s
-
-  echo ""
-  echo "Deployment complete!"
-  show_status
-  echo ""
-  echo "Useful commands:"
-  echo "  # Tail logs from all Plano pods:"
-  echo "  kubectl logs -l app=plano -n $NS -f"
-  echo ""
-  echo "  # Open Jaeger UI:"
-  echo "  kubectl port-forward svc/jaeger 16686:16686 -n $NS &"
-  echo "  open http://localhost:16686"
-  echo ""
-  echo "  # Access Redis CLI:"
-  echo "  kubectl exec -it redis-0 -n $NS -- redis-cli"
-  echo ""
-  echo "  # Run the verification script:"
-  echo "  python $DEMO_DIR/verify_affinity.py"
-}
-
-destroy() {
-  echo "==> Deleting Plano, Jaeger, and Redis resources..."
-  kubectl delete -f "$K8S_DIR/plano.yaml"     --ignore-not-found
-  kubectl delete -f "$K8S_DIR/jaeger.yaml"    --ignore-not-found
-  kubectl delete -f "$K8S_DIR/redis.yaml"     --ignore-not-found
-  kubectl delete -f "$K8S_DIR/plano-config.yaml" --ignore-not-found
-  kubectl delete secret plano-secrets -n "$NS" --ignore-not-found
-
-  echo ""
-  echo "Resources deleted."
-  echo "Namespace '$NS' was kept. Remove it manually if desired:"
-  echo "  kubectl delete namespace $NS"
-}
-
-show_status() {
-  echo ""
-  echo "=== Pods ==="
-  kubectl get pods -n "$NS" -o wide
-  echo ""
-  echo "=== Services ==="
-  kubectl get svc -n "$NS"
-  echo ""
-  echo "=== HPA ==="
-  kubectl get hpa -n "$NS" 2>/dev/null || true
-}
-
-check_prereqs
-
-case "${1:-}" in
-  --destroy)  destroy ;;
-  --status)   show_status ;;
-  *)          deploy ;;
-esac
--- a/demos/llm_routing/session_affinity_redis_k8s/k8s/jaeger.yaml
+++ b/demos/llm_routing/session_affinity_redis_k8s/k8s/jaeger.yaml
@ -1,56 +0,0 @@
---
-apiVersion: apps/v1
-kind: Deployment
-metadata:
-  name: jaeger
-  namespace: plano-demo
-  labels:
-    app: jaeger
-spec:
-  replicas: 1
-  selector:
-    matchLabels:
-      app: jaeger
-  template:
-    metadata:
-      labels:
-        app: jaeger
-    spec:
-      containers:
-        - name: jaeger
-          image: jaegertracing/jaeger:2.3.0
-          ports:
-            - containerPort: 16686  # UI
-            - containerPort: 4317   # OTLP gRPC
-            - containerPort: 4318   # OTLP HTTP
-          resources:
-            requests:
-              memory: "128Mi"
-              cpu: "100m"
-            limits:
-              memory: "512Mi"
-              cpu: "500m"
---
-apiVersion: v1
-kind: Service
-metadata:
-  name: jaeger
-  namespace: plano-demo
-  labels:
-    app: jaeger
-spec:
-  selector:
-    app: jaeger
-  ports:
-    - name: ui
-      port: 16686
-      targetPort: 16686
-    - name: otlp-grpc
-      port: 4317
-      targetPort: 4317
-    - name: otlp-http
-      port: 4318
-      targetPort: 4318
---
-# NodePort for UI access from your laptop.
-# Access at: http://localhost:16686 after: kubectl port-forward svc/jaeger 16686:16686 -n plano-demo
--- a/demos/llm_routing/session_affinity_redis_k8s/k8s/namespace.yaml
+++ b/demos/llm_routing/session_affinity_redis_k8s/k8s/namespace.yaml
@ -1,6 +0,0 @@
-apiVersion: v1
-kind: Namespace
-metadata:
-  name: plano-demo
-  labels:
-    app.kubernetes.io/part-of: plano-session-affinity-demo
--- a/demos/llm_routing/session_affinity_redis_k8s/k8s/plano-config.yaml
+++ b/demos/llm_routing/session_affinity_redis_k8s/k8s/plano-config.yaml
@ -1,50 +0,0 @@
---
-# ConfigMap wrapping the Plano config file.
-# Regenerate after editing config_k8s.yaml:
-#   kubectl create configmap plano-config \
-#     --from-file=plano_config.yaml=../config_k8s.yaml \
-#     -n plano-demo --dry-run=client -o yaml | kubectl apply -f -
-apiVersion: v1
-kind: ConfigMap
-metadata:
-  name: plano-config
-  namespace: plano-demo
-data:
-  plano_config.yaml: |
-    version: v0.4.0
-
-    listeners:
-      - type: model
-        name: model_listener
-        port: 12000
-
-    model_providers:
-      - model: openai/gpt-4o-mini
-        access_key: $OPENAI_API_KEY
-        default: true
-
-      - model: openai/gpt-5.2
-        access_key: $OPENAI_API_KEY
-
-    routing_preferences:
-      - name: fast_responses
-        description: short factual questions, quick lookups, simple summarization, or greetings
-        models:
-          - openai/gpt-4o-mini
-
-      - name: deep_reasoning
-        description: multi-step reasoning, complex analysis, code review, or detailed explanations
-        models:
-          - openai/gpt-5.2
-          - openai/gpt-4o-mini
-
-    routing:
-      session_ttl_seconds: 600
-      session_cache:
-        type: redis
-        url: redis://redis.plano-demo.svc.cluster.local:6379
-
-    tracing:
-      random_sampling: 100
-      trace_arch_internal: true
-      opentracing_grpc_endpoint: http://jaeger.plano-demo.svc.cluster.local:4317
--- a/demos/llm_routing/session_affinity_redis_k8s/k8s/plano-secrets.example.yaml
+++ b/demos/llm_routing/session_affinity_redis_k8s/k8s/plano-secrets.example.yaml
@ -1,19 +0,0 @@
-# EXAMPLE — do NOT apply this file directly.
-# Create the real secret with:
-#
-#   kubectl create secret generic plano-secrets \
-#     --from-literal=OPENAI_API_KEY=sk-... \
-#     -n plano-demo
-#
-# Or use the deploy.sh script, which prompts for keys and creates the secret.
-#
-# If you use a secrets manager (AWS Secrets Manager, GCP Secret Manager, Vault)
-# replace this with an ExternalSecret or a CSI driver volume mount instead.
-apiVersion: v1
-kind: Secret
-metadata:
-  name: plano-secrets
-  namespace: plano-demo
-type: Opaque
-stringData:
-  OPENAI_API_KEY: "sk-replace-me"
--- a/demos/llm_routing/session_affinity_redis_k8s/k8s/plano.yaml
+++ b/demos/llm_routing/session_affinity_redis_k8s/k8s/plano.yaml
@ -1,130 +0,0 @@
---
-# Plano Deployment — 2 replicas sharing one Redis instance.
-# All replicas are stateless; routing state lives entirely in Redis.
-apiVersion: apps/v1
-kind: Deployment
-metadata:
-  name: plano
-  namespace: plano-demo
-  labels:
-    app: plano
-spec:
-  replicas: 2
-  selector:
-    matchLabels:
-      app: plano
-  template:
-    metadata:
-      labels:
-        app: plano
-    spec:
-      containers:
-        - name: plano
-          # Local dev: run-local.sh sets this to plano-redis:local and loads it
-          # into minikube/kind so no registry is needed.
-          # Production: replace with your registry image and use imagePullPolicy: Always.
-          image: plano-redis:local
-          imagePullPolicy: IfNotPresent
-          ports:
-            - containerPort: 12000
-              name: llm-gateway
-          envFrom:
-            - secretRef:
-                name: plano-secrets
-          env:
-            - name: LOG_LEVEL
-              value: "info"
-            - name: POD_NAME
-              valueFrom:
-                fieldRef:
-                  fieldPath: metadata.name
-          volumeMounts:
-            - name: plano-config
-              mountPath: /app/plano_config.yaml
-              subPath: plano_config.yaml
-              readOnly: true
-          readinessProbe:
-            httpGet:
-              path: /healthz
-              port: 12000
-            initialDelaySeconds: 5
-            periodSeconds: 10
-            failureThreshold: 3
-          livenessProbe:
-            httpGet:
-              path: /healthz
-              port: 12000
-            initialDelaySeconds: 15
-            periodSeconds: 30
-            failureThreshold: 3
-          resources:
-            requests:
-              memory: "512Mi"
-              cpu: "250m"
-            limits:
-              memory: "1Gi"
-              cpu: "1000m"
-      volumes:
-        - name: plano-config
-          configMap:
-            name: plano-config
---
-# LoadBalancer Service — exposes Plano externally.
-# On minikube, run: minikube tunnel
-# On kind, use NodePort instead (see comment below).
-# On cloud providers (GKE, EKS, AKS), an external IP is assigned automatically.
-apiVersion: v1
-kind: Service
-metadata:
-  name: plano
-  namespace: plano-demo
-  labels:
-    app: plano
-spec:
-  type: LoadBalancer
-  selector:
-    app: plano
-  ports:
-    - name: llm-gateway
-      port: 12000
-      targetPort: 12000
---
-# Uncomment and use instead of the LoadBalancer above when running on kind/minikube
-# without tunnel:
-#
-# apiVersion: v1
-# kind: Service
-# metadata:
-#   name: plano
-#   namespace: plano-demo
-# spec:
-#   type: NodePort
-#   selector:
-#     app: plano
-#   ports:
-#     - name: llm-gateway
-#       port: 12000
-#       targetPort: 12000
-#       nodePort: 32000
---
-# HorizontalPodAutoscaler — scales 2 to 5 replicas based on CPU.
-# Demonstrates that new replicas join the existing session state seamlessly.
-apiVersion: autoscaling/v2
-kind: HorizontalPodAutoscaler
-metadata:
-  name: plano
-  namespace: plano-demo
-spec:
-  scaleTargetRef:
-    apiVersion: apps/v1
-    kind: Deployment
-    name: plano
-  minReplicas: 2
-  maxReplicas: 5
-  metrics:
-    - type: Resource
-      resource:
-        name: cpu
-        target:
-          type: Utilization
-          averageUtilization: 70
--- a/demos/llm_routing/session_affinity_redis_k8s/k8s/redis.yaml
+++ b/demos/llm_routing/session_affinity_redis_k8s/k8s/redis.yaml
@ -1,96 +0,0 @@
---
-# Redis StatefulSet — single-shard, persistence enabled.
-# For production, replace with Redis Cluster or a managed service (ElastiCache, MemoryStore, etc.).
-apiVersion: apps/v1
-kind: StatefulSet
-metadata:
-  name: redis
-  namespace: plano-demo
-  labels:
-    app: redis
-spec:
-  serviceName: redis
-  replicas: 1
-  selector:
-    matchLabels:
-      app: redis
-  template:
-    metadata:
-      labels:
-        app: redis
-    spec:
-      containers:
-        - name: redis
-          image: redis:7-alpine
-          ports:
-            - containerPort: 6379
-              name: redis
-          command:
-            - redis-server
-            - --appendonly
-            - "yes"
-            - --maxmemory
-            - "256mb"
-            - --maxmemory-policy
-            - allkeys-lru
-          readinessProbe:
-            exec:
-              command: ["redis-cli", "ping"]
-            initialDelaySeconds: 5
-            periodSeconds: 5
-          livenessProbe:
-            exec:
-              command: ["redis-cli", "ping"]
-            initialDelaySeconds: 15
-            periodSeconds: 20
-          resources:
-            requests:
-              memory: "64Mi"
-              cpu: "100m"
-            limits:
-              memory: "320Mi"
-              cpu: "500m"
-          volumeMounts:
-            - name: redis-data
-              mountPath: /data
-  volumeClaimTemplates:
-    - metadata:
-        name: redis-data
-      spec:
-        accessModes: ["ReadWriteOnce"]
-        resources:
-          requests:
-            storage: 1Gi
---
-# Stable DNS name: redis.plano-demo.svc.cluster.local:6379
-apiVersion: v1
-kind: Service
-metadata:
-  name: redis
-  namespace: plano-demo
-  labels:
-    app: redis
-spec:
-  selector:
-    app: redis
-  ports:
-    - name: redis
-      port: 6379
-      targetPort: 6379
-  clusterIP: None  # headless — StatefulSet pods get stable DNS
---
-# Regular ClusterIP for application code (redis://redis:6379)
-apiVersion: v1
-kind: Service
-metadata:
-  name: redis-service
-  namespace: plano-demo
-  labels:
-    app: redis
-spec:
-  selector:
-    app: redis
-  ports:
-    - name: redis
-      port: 6379
-      targetPort: 6379
--- a/demos/llm_routing/session_affinity_redis_k8s/run-local.sh
+++ b/demos/llm_routing/session_affinity_redis_k8s/run-local.sh
@ -1,154 +0,0 @@
-#!/usr/bin/env bash
-# run-local.sh — Build and run the k8s session affinity demo entirely locally with kind.
-# No registry, no image push required.
-#
-# Usage:
-#   ./run-local.sh               # create cluster (if needed), build, deploy, verify
-#   ./run-local.sh --build-only  # build and load the image into kind
-#   ./run-local.sh --deploy-only # skip build, re-apply k8s manifests
-#   ./run-local.sh --verify      # run verify_affinity.py against the running cluster
-#   ./run-local.sh --down        # tear down k8s resources (keeps kind cluster)
-#   ./run-local.sh --delete-cluster  # also delete the kind cluster
-
-set -euo pipefail
-
-DEMO_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-REPO_ROOT="$(cd "$DEMO_DIR/../../.." && pwd)"
-IMAGE_NAME="plano-redis:local"
-KIND_CLUSTER="plano-demo"
-
-# ---------------------------------------------------------------------------
-# Prereq check
-# ---------------------------------------------------------------------------
-
-check_prereqs() {
-  local missing=()
-  command -v docker  >/dev/null 2>&1 || missing+=("docker")
-  command -v kubectl >/dev/null 2>&1 || missing+=("kubectl")
-  command -v kind    >/dev/null 2>&1 || missing+=("kind  (https://kind.sigs.k8s.io/docs/user/quick-start/#installation)")
-  command -v python3 >/dev/null 2>&1 || missing+=("python3")
-
-  if [ ${#missing[@]} -gt 0 ]; then
-    echo "ERROR: missing required tools:"
-    for t in "${missing[@]}"; do echo "  - $t"; done
-    exit 1
-  fi
-}
-
-load_env() {
-  if [ -f "$DEMO_DIR/.env" ]; then
-    set -a
-    # shellcheck disable=SC1091
-    source "$DEMO_DIR/.env"
-    set +a
-  fi
-}
-
-# ---------------------------------------------------------------------------
-# Cluster lifecycle
-# ---------------------------------------------------------------------------
-
-ensure_cluster() {
-  if kind get clusters 2>/dev/null | grep -q "^${KIND_CLUSTER}$"; then
-    echo "==> kind cluster '$KIND_CLUSTER' already exists, reusing."
-  else
-    echo "==> Creating kind cluster '$KIND_CLUSTER'..."
-    kind create cluster --name "$KIND_CLUSTER"
-    echo "    Cluster created."
-  fi
-
-  # Point kubectl at this cluster
-  kubectl config use-context "kind-${KIND_CLUSTER}" >/dev/null
-}
-
-# ---------------------------------------------------------------------------
-# Build and load
-# ---------------------------------------------------------------------------
-
-build() {
-  echo "==> Building image '$IMAGE_NAME' from repo root..."
-  docker build \
-    --file "$DEMO_DIR/Dockerfile" \
-    --tag "$IMAGE_NAME" \
-    --progress=plain \
-    "$REPO_ROOT"
-
-  echo "==> Loading '$IMAGE_NAME' into kind cluster '$KIND_CLUSTER'..."
-  kind load docker-image "$IMAGE_NAME" --name "$KIND_CLUSTER"
-  echo "    Image loaded."
-}
-
-# ---------------------------------------------------------------------------
-# Deploy / verify / teardown
-# ---------------------------------------------------------------------------
-
-deploy() {
-  echo ""
-  echo "==> Deploying to Kubernetes..."
-  "$DEMO_DIR/deploy.sh"
-}
-
-verify() {
-  echo ""
-  echo "==> Running cross-replica verification..."
-  python3 "$DEMO_DIR/verify_affinity.py"
-}
-
-down() {
-  "$DEMO_DIR/deploy.sh" --destroy
-}
-
-delete_cluster() {
-  echo "==> Deleting kind cluster '$KIND_CLUSTER'..."
-  kind delete cluster --name "$KIND_CLUSTER"
-  echo "    Cluster deleted."
-}
-
-# ---------------------------------------------------------------------------
-# Main
-# ---------------------------------------------------------------------------
-
-case "${1:-}" in
-  --build-only)
-    check_prereqs
-    load_env
-    ensure_cluster
-    build
-    ;;
-  --deploy-only)
-    check_prereqs
-    load_env
-    ensure_cluster
-    deploy
-    ;;
-  --verify)
-    check_prereqs
-    verify
-    ;;
-  --down)
-    check_prereqs
-    down
-    ;;
-  --delete-cluster)
-    check_prereqs
-    down
-    delete_cluster
-    ;;
-  "")
-    check_prereqs
-    load_env
-    ensure_cluster
-    echo ""
-    build
-    deploy
-    echo ""
-    echo "==> Everything is up. Running verification in 5 seconds..."
-    echo "    (Ctrl-C to skip — run manually with: ./run-local.sh --verify)"
-    sleep 5
-    verify
-    ;;
-  *)
-    echo "Usage: $0 [--build-only | --deploy-only | --verify | --down | --delete-cluster]"
-    exit 1
-    ;;
-esac
--- a/demos/llm_routing/session_affinity_redis_k8s/verify_affinity.py
+++ b/demos/llm_routing/session_affinity_redis_k8s/verify_affinity.py
@ -1,418 +0,0 @@
-#!/usr/bin/env python3
-"""
-verify_affinity.py — Prove that Redis-backed session affinity works across Plano replicas.
-
-Strategy
--------
-Kubernetes round-robin is non-deterministic, so simply hammering the LoadBalancer
-service is not a reliable proof. Instead this script:
-
-  1. Discovers the two (or more) Plano pod names with kubectl.
-  2. Opens a kubectl port-forward tunnel to EACH pod on a separate local port.
-  3. Pins a session via Pod 0 (writes the Redis key).
-  4. Reads the same session via Pod 1 (must return the same model — reads Redis).
-  5. Repeats across N sessions, round-robining which pod sets vs. reads the pin.
-
-If every round returns the same model, Redis is the shared source of truth and
-multi-replica affinity is proven.
-
-Usage
-----
-  # From inside the cluster network (e.g. CI job or jumpbox):
-  python verify_affinity.py --url http://<LoadBalancer-IP>:12000
-
-  # From your laptop (uses kubectl port-forward automatically):
-  python verify_affinity.py
-
-  # More sessions / rounds:
-  python verify_affinity.py --sessions 5 --rounds 6
-
-Requirements
------------
-  kubectl   — configured to reach the plano-demo namespace
-  Python 3.11+
-"""
-
-import argparse
-import http.client
-import json
-import signal
-import subprocess
-import sys
-import time
-import urllib.error
-import urllib.request
-from contextlib import contextmanager
-
-NAMESPACE = "plano-demo"
-BASE_LOCAL_PORT = 19100  # port-forward starts here, increments per pod
-
-PROMPTS = [
-    "Explain the difference between TCP and UDP in detail.",
-    "Write a merge sort implementation in Python.",
-    "What is quantum entanglement?",
-    "Describe the CAP theorem with examples.",
-    "How does gradient descent work in neural networks?",
-    "What is the time complexity of Dijkstra's algorithm?",
-]
-
-
-# ---------------------------------------------------------------------------
-# kubectl helpers
-# ---------------------------------------------------------------------------
-
-
-def get_pod_names() -> list[str]:
-    """Return running Plano pod names in the plano-demo namespace."""
-    result = subprocess.run(
-        [
-            "kubectl",
-            "get",
-            "pods",
-            "-n",
-            NAMESPACE,
-            "-l",
-            "app=plano",
-            "--field-selector=status.phase=Running",
-            "-o",
-            "jsonpath={.items[*].metadata.name}",
-        ],
-        capture_output=True,
-        text=True,
-        check=True,
-    )
-    pods = result.stdout.strip().split()
-    if not pods or pods == [""]:
-        raise RuntimeError(
-            f"No running Plano pods found in namespace '{NAMESPACE}'.\n"
-            "Is the cluster deployed? Run: ./deploy.sh"
-        )
-    return pods
-
-
-@contextmanager
-def port_forward(pod_name: str, local_port: int, remote_port: int = 12000):
-    """Context manager that starts and stops a kubectl port-forward."""
-    proc = subprocess.Popen(
-        [
-            "kubectl",
-            "port-forward",
-            f"pod/{pod_name}",
-            f"{local_port}:{remote_port}",
-            "-n",
-            NAMESPACE,
-        ],
-        stdout=subprocess.DEVNULL,
-        stderr=subprocess.DEVNULL,
-    )
-    # Give the tunnel a moment to establish
-    time.sleep(1.5)
-    try:
-        yield f"http://localhost:{local_port}"
-    finally:
-        proc.send_signal(signal.SIGTERM)
-        try:
-            proc.wait(timeout=3)
-        except subprocess.TimeoutExpired:
-            proc.kill()
-
-
-# ---------------------------------------------------------------------------
-# HTTP helpers
-# ---------------------------------------------------------------------------
-
-
-def chat(
-    base_url: str,
-    session_id: str | None,
-    message: str,
-    model: str = "openai/gpt-4o-mini",
-    retries: int = 3,
-    retry_delay: float = 5.0,
-) -> dict:
-    payload = json.dumps(
-        {
-            "model": model,
-            "messages": [{"role": "user", "content": message}],
-        }
-    ).encode()
-
-    headers = {"Content-Type": "application/json"}
-    if session_id:
-        headers["x-model-affinity"] = session_id
-
-    req = urllib.request.Request(
-        f"{base_url}/v1/chat/completions",
-        data=payload,
-        headers=headers,
-        method="POST",
-    )
-    last_err: Exception | None = None
-    for attempt in range(retries):
-        try:
-            with urllib.request.urlopen(req, timeout=60) as resp:
-                body = resp.read()
-                if not body:
-                    raise RuntimeError(f"Empty response body from {base_url}")
-                return json.loads(body)
-        except urllib.error.HTTPError as e:
-            if e.code in (503, 502, 429) and attempt < retries - 1:
-                time.sleep(retry_delay * (attempt + 1))
-                last_err = e
-                continue
-            raise RuntimeError(f"Request to {base_url} failed: {e}") from e
-        except (
-            urllib.error.URLError,
-            http.client.RemoteDisconnected,
-            RuntimeError,
-        ) as e:
-            if attempt < retries - 1:
-                time.sleep(retry_delay * (attempt + 1))
-                last_err = e
-                continue
-            raise RuntimeError(f"Request to {base_url} failed: {e}") from e
-        except json.JSONDecodeError as e:
-            raise RuntimeError(f"Invalid JSON from {base_url}: {e}") from e
-    raise RuntimeError(
-        f"Request to {base_url} failed after {retries} attempts: {last_err}"
-    )
-
-
-def extract_model(response: dict) -> str:
-    return response.get("model", "<unknown>")
-
-
-# ---------------------------------------------------------------------------
-# Verification phases
-# ---------------------------------------------------------------------------
-
-
-def phase_loadbalancer(url: str, rounds: int) -> None:
-    """Phase 0: quick smoke test against the LoadBalancer / provided URL."""
-    print("=" * 66)
-    print(f"Phase 0: Smoke test against {url}")
-    print("=" * 66)
-    for i in range(rounds):
-        resp = chat(url, None, PROMPTS[i % len(PROMPTS)])
-        print(f"  Request {i + 1}: model = {extract_model(resp)}")
-    print()
-
-
-def phase_cross_replica(
-    pod_urls: dict[str, str], num_sessions: int, rounds: int
-) -> bool:
-    """
-    Phase 1 — Cross-replica pinning.
-
-    For each session:
-      • Round 1: send to pod_A  (sets the Redis key)
-      • Rounds 2+: alternate between pod_A and pod_B
-      • Assert every round returns the same model.
-    """
-    pod_names = list(pod_urls.keys())
-    all_passed = True
-    session_results: dict[str, dict] = {}
-
-    print("=" * 66)
-    print("Phase 1: Cross-replica session pinning")
-    print(f"  Pods under test : {', '.join(pod_names)}")
-    print(f"  Sessions        : {num_sessions}")
-    print(f"  Rounds/session  : {rounds}")
-    print()
-    print("  Each session is PINNED via one pod and VERIFIED via another.")
-    print("  If Redis is shared, every round must return the same model.")
-    print("=" * 66)
-
-    for s in range(num_sessions):
-        session_id = f"k8s-session-{s + 1:03d}"
-        models_seen = []
-        pod_sequence = []
-
-        for r in range(rounds):
-            # Alternate which pod handles each round
-            pod_name = pod_names[r % len(pod_names)]
-            url = pod_urls[pod_name]
-
-            try:
-                resp = chat(url, session_id, PROMPTS[(s + r) % len(PROMPTS)])
-                model = extract_model(resp)
-            except RuntimeError as e:
-                print(f"  ERROR on {pod_name} round {r + 1}: {e}")
-                all_passed = False
-                continue
-
-            models_seen.append(model)
-            pod_sequence.append(pod_name)
-
-        unique_models = set(models_seen)
-        passed = len(unique_models) == 1
-
-        session_results[session_id] = {
-            "passed": passed,
-            "model": models_seen[0] if models_seen else "<none>",
-            "unique_models": unique_models,
-            "pod_sequence": pod_sequence,
-        }
-
-        status = "PASS" if passed else "FAIL"
-        detail = models_seen[0] if passed else str(unique_models)
-        print(f"\n  {status}  {session_id}")
-        print(f"        model      : {detail}")
-        print(f"        pod order  : {' → '.join(pod_sequence)}")
-
-        if not passed:
-            all_passed = False
-
-    return all_passed
-
-
-def phase_redis_inspect(num_sessions: int) -> None:
-    """Phase 2: read keys directly from Redis to show what's stored."""
-    print()
-    print("=" * 66)
-    print("Phase 2: Redis key inspection")
-    print("=" * 66)
-    for s in range(num_sessions):
-        session_id = f"k8s-session-{s + 1:03d}"
-        result = subprocess.run(
-            [
-                "kubectl",
-                "exec",
-                "-n",
-                NAMESPACE,
-                "redis-0",
-                "--",
-                "redis-cli",
-                "GET",
-                session_id,
-            ],
-            capture_output=True,
-            text=True,
-        )
-        raw = result.stdout.strip()
-        ttl_result = subprocess.run(
-            [
-                "kubectl",
-                "exec",
-                "-n",
-                NAMESPACE,
-                "redis-0",
-                "--",
-                "redis-cli",
-                "TTL",
-                session_id,
-            ],
-            capture_output=True,
-            text=True,
-        )
-        ttl = ttl_result.stdout.strip()
-
-        if raw and raw != "(nil)":
-            try:
-                data = json.loads(raw)
-                print(f"  {session_id}")
-                print(f"    model_name : {data.get('model_name', '?')}")
-                print(f"    route_name : {data.get('route_name', 'null')}")
-                print(f"    TTL        : {ttl}s remaining")
-            except json.JSONDecodeError:
-                print(f"  {session_id}: (raw) {raw}")
-        else:
-            print(f"  {session_id}: key not found or expired")
-
-
-# ---------------------------------------------------------------------------
-# Main
-# ---------------------------------------------------------------------------
-
-
-def main() -> None:
-    parser = argparse.ArgumentParser(
-        description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
-    )
-    parser.add_argument(
-        "--url",
-        default=None,
-        help="LoadBalancer URL to use instead of per-pod port-forwards. "
-        "When set, cross-replica proof is skipped (no pod targeting).",
-    )
-    parser.add_argument(
-        "--sessions", type=int, default=4, help="Number of sessions (default 4)"
-    )
-    parser.add_argument(
-        "--rounds", type=int, default=4, help="Rounds per session (default 4)"
-    )
-    parser.add_argument(
-        "--skip-redis-inspect", action="store_true", help="Skip Redis key inspection"
-    )
-    args = parser.parse_args()
-
-    if args.url:
-        # Simple mode: hit the LoadBalancer directly
-        print(f"Mode: LoadBalancer ({args.url})")
-        print()
-        phase_loadbalancer(args.url, args.rounds)
-        print("To get the full cross-replica proof, run without --url.")
-        sys.exit(0)
-
-    # Full mode: port-forward to each pod individually
-    print("Mode: per-pod port-forward (full cross-replica proof)")
-    print()
-
-    try:
-        pod_names = get_pod_names()
-    except (subprocess.CalledProcessError, RuntimeError) as e:
-        print(f"ERROR: {e}", file=sys.stderr)
-        sys.exit(1)
-
-    if len(pod_names) < 2:
-        print(f"WARNING: only {len(pod_names)} Plano pod(s) running.")
-        print("  For a true cross-replica test you need at least 2.")
-        print("  Scale up: kubectl scale deployment/plano --replicas=2 -n plano-demo")
-        print()
-
-    print(f"Found {len(pod_names)} Plano pod(s): {', '.join(pod_names)}")
-    print("Opening per-pod port-forward tunnels...")
-    print()
-
-    pod_urls: dict[str, str] = {}
-    contexts = []
-
-    for i, pod in enumerate(pod_names):
-        local_port = BASE_LOCAL_PORT + i
-        ctx = port_forward(pod, local_port)
-        url = ctx.__enter__()
-        pod_urls[pod] = url
-        contexts.append((ctx, url))
-        print(f"  {pod} → localhost:{local_port}")
-
-    print()
-
-    try:
-        passed = phase_cross_replica(pod_urls, args.sessions, args.rounds)
-
-        if not args.skip_redis_inspect:
-            phase_redis_inspect(args.sessions)
-
-        print()
-        print("=" * 66)
-        print("Summary")
-        print("=" * 66)
-        if passed:
-            print("All sessions were pinned consistently across replicas.")
-            print("Redis session cache is working correctly in Kubernetes.")
-        else:
-            print("One or more sessions were NOT consistent across replicas.")
-            print("Check brightstaff logs: kubectl logs -l app=plano -n plano-demo")
-
-    finally:
-        for ctx, _ in contexts:
-            try:
-                ctx.__exit__(None, None, None)
-            except Exception:
-                pass
-
-    sys.exit(0 if passed else 1)
-
-
-if __name__ == "__main__":
-    main()