plano/demos/llm_routing/session_affinity_redis_k8s/README.md

10 KiB
Raw Blame History

Session Affinity — Multi-Replica Kubernetes Deployment

Production-style Kubernetes demo that proves Redis-backed session affinity (X-Model-Affinity) works correctly when Plano runs as multiple replicas behind a load balancer.

Architecture

                     ┌─────────────────────────────────────────┐
                     │           Kubernetes Cluster             │
                     │                                          │
   Client ──────────►│  LoadBalancer Service (port 12000)       │
                     │       │               │                  │
                     │  ┌────▼────┐    ┌─────▼───┐             │
                     │  │ Plano   │    │ Plano   │  (replicas) │
                     │  │ Pod 0   │    │ Pod 1   │             │
                     │  └────┬────┘    └────┬────┘             │
                     │       └──────┬───────┘                  │
                     │         ┌────▼────┐                     │
                     │         │  Redis  │ (StatefulSet)        │
                     │         │ Pod     │ shared session store │
                     │         └─────────┘                     │
                     │                                          │
                     │  ┌──────────┐                           │
                     │  │  Jaeger  │  distributed tracing       │
                     │  └──────────┘                           │
                     └─────────────────────────────────────────┘

What makes this production-like:

Feature Detail
2 Plano replicas replicas: 2 with HPA (scales 25 on CPU)
Shared Redis StatefulSet with PVC — sessions survive pod restarts
Session TTL 600 s, enforced natively by Redis EX
Eviction policy allkeys-lru — Redis auto-evicts oldest sessions under memory pressure
Distributed tracing Jaeger collects spans from both pods
Health probes Readiness + liveness gates traffic away from unhealthy pods

Quick Start (local — no registry needed)

# 1. Install kind if needed
#    https://kind.sigs.k8s.io/docs/user/quick-start/#installation
#    brew install kind   (macOS)

# 2. Set your API key
export OPENAI_API_KEY=sk-...
# or copy and edit:
cp .env.example .env

# 3. Build, deploy, and verify in one command
./run-local.sh

run-local.sh creates a kind cluster named plano-demo (if it doesn't exist), builds the image locally, loads it into the cluster with kind load docker-imageno registry, no push required.

Individual steps:

./run-local.sh --build-only      # (re-)build and reload image into kind
./run-local.sh --deploy-only     # (re-)apply k8s manifests
./run-local.sh --verify          # run verify_affinity.py
./run-local.sh --down            # delete k8s resources (keeps kind cluster)
./run-local.sh --delete-cluster  # delete k8s resources + kind cluster

Prerequisites

Tool Notes
kubectl Configured to reach a Kubernetes cluster
docker To build and push the custom image
Container registry (optional) Needed only when you are not using the local kind flow
OPENAI_API_KEY For model inference
Python 3.11+ Only for verify_affinity.py

Cluster: run-local.sh creates and manages a kind cluster named plano-demo automatically. Install kind from https://kind.sigs.k8s.io or brew install kind.

Step 1 — Build the Image

Build a custom image from the repo root:

# From this demo directory:
./build-and-push.sh ghcr.io/yourorg/plano-redis:latest

# Or manually from the repo root:
docker build \
  -f demos/llm_routing/session_affinity_redis_k8s/Dockerfile \
  -t ghcr.io/yourorg/plano-redis:latest \
  .
docker push ghcr.io/yourorg/plano-redis:latest

Then update the image reference in k8s/plano.yaml (skip this when using run-local.sh, which uses plano-redis:local automatically):

image: ghcr.io/yourorg/plano-redis:latest  # ← replace YOUR_REGISTRY/plano-redis:latest

Step 2 — Deploy

./deploy.sh

The script:

  1. Creates the plano-demo namespace
  2. Prompts for OPENAI_API_KEY and creates a Kubernetes Secret
  3. Applies Redis, Jaeger, ConfigMap, and Plano manifests in order
  4. Waits for rollouts to complete

Expected output:

==> Applying namespace...
==> Creating API key secret...
  OPENAI_API_KEY: [hidden]
==> Applying Redis (StatefulSet + Services)...
==> Applying Jaeger...
==> Applying Plano config (ConfigMap)...
==> Applying Plano deployment + HPA...
==> Waiting for Redis to be ready...
==> Waiting for Plano pods to be ready...

Deployment complete!

=== Pods ===
NAME                     READY   STATUS    NODE
redis-0                  1/1     Running   node-1
plano-6d8f9b-xk2pq       1/1     Running   node-1
plano-6d8f9b-r7nlw       1/1     Running   node-2
jaeger-5c7d8f-q9mnb      1/1     Running   node-1

=== Services ===
NAME      TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)
plano     LoadBalancer   10.96.12.50    203.0.113.42   12000:32000/TCP
redis     ClusterIP      None           <none>          6379/TCP
jaeger    ClusterIP      10.96.8.71     <none>          16686/TCP,...

Step 3 — Verify Session Affinity Across Replicas

python verify_affinity.py

The script opens a dedicated kubectl port-forward tunnel to each pod individually. This is the definitive test: it routes requests to specific pods rather than relying on random load-balancer assignment.

Mode: per-pod port-forward (full cross-replica proof)

Found 2 Plano pod(s): plano-6d8f9b-xk2pq, plano-6d8f9b-r7nlw
Opening per-pod port-forward tunnels...

  plano-6d8f9b-xk2pq → localhost:19100
  plano-6d8f9b-r7nlw → localhost:19101

==================================================================
Phase 1: Cross-replica session pinning
  Pods under test : plano-6d8f9b-xk2pq, plano-6d8f9b-r7nlw
  Sessions        : 4
  Rounds/session  : 4

  Each session is PINNED via one pod and VERIFIED via another.
  If Redis is shared, every round must return the same model.
==================================================================

  PASS  k8s-session-001
        model      : gpt-4o-mini-2024-07-18
        pod order  : plano-6d8f9b-xk2pq → plano-6d8f9b-r7nlw → plano-6d8f9b-xk2pq → plano-6d8f9b-r7nlw

  PASS  k8s-session-002
        model      : gpt-5.2
        pod order  : plano-6d8f9b-r7nlw → plano-6d8f9b-xk2pq → plano-6d8f9b-r7nlw → plano-6d8f9b-xk2pq

  PASS  k8s-session-003
        model      : gpt-4o-mini-2024-07-18
        pod order  : plano-6d8f9b-xk2pq → plano-6d8f9b-r7nlw → plano-6d8f9b-xk2pq → plano-6d8f9b-r7nlw

  PASS  k8s-session-004
        model      : gpt-5.2
        pod order  : plano-6d8f9b-r7nlw → plano-6d8f9b-xk2pq → plano-6d8f9b-r7nlw → plano-6d8f9b-xk2pq

==================================================================
Phase 2: Redis key inspection
==================================================================
  k8s-session-001
    model_name : gpt-4o-mini-2024-07-18
    route_name : fast_responses
    TTL        : 587s remaining
  k8s-session-002
    model_name : gpt-5.2
    route_name : deep_reasoning
    TTL        : 581s remaining
  ...

==================================================================
Summary
==================================================================
All sessions were pinned consistently across replicas.
Redis session cache is working correctly in Kubernetes.

What to Look For

The cross-replica proof

Each session's pod order line shows it alternating between the two pods:

pod order: pod-A → pod-B → pod-A → pod-B

Round 1 sets the Redis key (via pod-A). Rounds 2, 3, 4 read from Redis on alternating pods. If the model stays the same across all rounds, Redis is the shared source of truth — not any in-process state.

Redis keys

kubectl exec -it redis-0 -n plano-demo -- redis-cli

127.0.0.1:6379> KEYS *
1) "k8s-session-001"
2) "k8s-session-002"

127.0.0.1:6379> GET k8s-session-001
{"model_name":"gpt-4o-mini-2024-07-18","route_name":"fast_responses"}

127.0.0.1:6379> TTL k8s-session-001
(integer) 587

Jaeger traces

kubectl port-forward svc/jaeger 16686:16686 -n plano-demo

Open http://localhost:16686, select service plano.

  • Pinned requests — no span to the Arch-Router (decision served from Redis)
  • First request per session — spans include the router call + a Redis SET
  • Both Plano pods appear as separate instances in the trace list

Scaling up (HPA in action)

# Scale to 3 replicas manually
kubectl scale deployment/plano --replicas=3 -n plano-demo

# Run verification again — now 3 pods alternate
python verify_affinity.py --sessions 6

Existing sessions in Redis are unaffected by the scale event. New pods immediately participate in the shared session pool.

Teardown

./deploy.sh --destroy
# Then optionally:
kubectl delete namespace plano-demo

Notes

  • The Redis StatefulSet uses a PersistentVolumeClaim. Session data survives pod restarts within a TTL window but is not HA. For production, replace with Redis Sentinel, Redis Cluster, or a managed service (ElastiCache, MemoryStore).
  • session_max_entries is not enforced by this backend — Redis uses maxmemory-policy: allkeys-lru instead, which is a global limit across all keys rather than a per-application cap.
  • On minikube, run minikube tunnel in a separate terminal to get an external IP for the LoadBalancer service.
  • On kind, switch to NodePort (see the comment in k8s/plano.yaml).