mirror of
https://github.com/katanemo/plano.git
synced 2026-05-18 13:45:15 +02:00
add Redis session affinity demos (Docker Compose and Kubernetes)
This commit is contained in:
parent
50670f843d
commit
90810078da
20 changed files with 2080 additions and 0 deletions
1
demos/llm_routing/session_affinity_redis/.env.example
Normal file
1
demos/llm_routing/session_affinity_redis/.env.example
Normal file
|
|
@ -0,0 +1 @@
|
|||
OPENAI_API_KEY=sk-replace-me
|
||||
247
demos/llm_routing/session_affinity_redis/README.md
Normal file
247
demos/llm_routing/session_affinity_redis/README.md
Normal file
|
|
@ -0,0 +1,247 @@
|
|||
# Session Affinity with Redis — Multi-Replica Model Pinning
|
||||
|
||||
This demo shows Plano's **session affinity** (`X-Model-Affinity` header) backed by a **Redis session cache** instead of the default in-memory store.
|
||||
|
||||
## The Problem
|
||||
|
||||
By default, model affinity stores routing decisions in a per-process `HashMap`.
|
||||
This works for single-instance deployments, but breaks when you run multiple
|
||||
Plano replicas behind a load balancer:
|
||||
|
||||
```
|
||||
Client ──► Load Balancer ──► Replica A (session pinned here)
|
||||
└──► Replica B (knows nothing about the session)
|
||||
```
|
||||
|
||||
A request that was pinned to `gpt-4o` on Replica A will be re-routed from
|
||||
scratch on Replica B, defeating the purpose of affinity.
|
||||
|
||||
## The Solution
|
||||
|
||||
Plano's `session_cache` config key accepts a `type: redis` backend that is
|
||||
shared across all replicas:
|
||||
|
||||
```yaml
|
||||
routing:
|
||||
session_ttl_seconds: 300
|
||||
session_cache:
|
||||
type: redis
|
||||
url: redis://localhost:6379
|
||||
```
|
||||
|
||||
All replicas read and write the same Redis keyspace. A session pinned on any
|
||||
replica is immediately visible to all others.
|
||||
|
||||
## What to Look For
|
||||
|
||||
| What | Expected behaviour |
|
||||
|------|--------------------|
|
||||
| First request with a session ID | Plano routes normally (via Arch-Router) and writes the result to Redis (`SET session-id ... EX 300`) |
|
||||
| Subsequent requests with the **same** session ID | Plano reads from Redis and skips the router — same model every time |
|
||||
| Requests with a **different** session ID | Routed independently; may land on a different model |
|
||||
| After `session_ttl_seconds` elapses | Redis key expires; next request re-routes and sets a new pin |
|
||||
| `x-plano-pinned: true` response header | Tells you the response was served from the session cache |
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Client
|
||||
│ X-Model-Affinity: my-session-id
|
||||
▼
|
||||
Plano (brightstaff)
|
||||
├── GET redis://localhost:6379/my-session-id
|
||||
│ hit? → return pinned model immediately (no Arch-Router call)
|
||||
│ miss? → call Arch-Router → SET key EX 300 → return routed model
|
||||
▼
|
||||
Redis (shared across replicas)
|
||||
```
|
||||
|
||||
## Prerequisites
|
||||
|
||||
| Requirement | Notes |
|
||||
|-------------|-------|
|
||||
| `planoai` CLI | `pip install planoai` |
|
||||
| Docker + Docker Compose | For Redis and Jaeger |
|
||||
| `OPENAI_API_KEY` | Required for routing model (Arch-Router) and downstream LLMs |
|
||||
| Python 3.11+ | Only needed to run `verify_affinity.py` |
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# 1. Set your API key
|
||||
export OPENAI_API_KEY=sk-...
|
||||
# or copy and edit:
|
||||
cp .env.example .env
|
||||
|
||||
# 2. Start Redis, Jaeger, and Plano
|
||||
./run_demo.sh up
|
||||
|
||||
# 3. Verify session pinning works
|
||||
python verify_affinity.py
|
||||
```
|
||||
|
||||
## Manual Verification with curl
|
||||
|
||||
### Step 1 — Pin a session (first request sets the affinity)
|
||||
|
||||
```bash
|
||||
curl -s http://localhost:12000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "x-model-affinity: my-session-abc" \
|
||||
-d '{"model":"openai/gpt-4o-mini","messages":[{"role":"user","content":"Write a short poem about the ocean."}]}' \
|
||||
| jq '{model, pinned: .x_plano_pinned}'
|
||||
```
|
||||
|
||||
Expected output (first request — not yet pinned, Arch-Router picks the model):
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "openai/gpt-5.2",
|
||||
"pinned": null
|
||||
}
|
||||
```
|
||||
|
||||
### Step 2 — Confirm the pin is held on subsequent requests
|
||||
|
||||
```bash
|
||||
for i in 1 2 3 4; do
|
||||
curl -s http://localhost:12000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "x-model-affinity: my-session-abc" \
|
||||
-d "{\"model\":\"openai/gpt-4o-mini\",\"messages\":[{\"role\":\"user\",\"content\":\"Request $i\"}]}" \
|
||||
| jq -r '"\(.model)"'
|
||||
done
|
||||
```
|
||||
|
||||
Expected output (same model for every request):
|
||||
|
||||
```
|
||||
openai/gpt-5.2
|
||||
openai/gpt-5.2
|
||||
openai/gpt-5.2
|
||||
openai/gpt-5.2
|
||||
```
|
||||
|
||||
### Step 3 — Inspect the Redis key directly
|
||||
|
||||
```bash
|
||||
docker exec plano-session-redis redis-cli \
|
||||
GET my-session-abc | python3 -m json.tool
|
||||
```
|
||||
|
||||
Expected output:
|
||||
|
||||
```json
|
||||
{
|
||||
"model_name": "openai/gpt-5.2",
|
||||
"route_name": "deep_reasoning"
|
||||
}
|
||||
```
|
||||
|
||||
```bash
|
||||
# Check the TTL (seconds remaining)
|
||||
docker exec plano-session-redis redis-cli TTL my-session-abc
|
||||
# e.g. 287
|
||||
```
|
||||
|
||||
### Step 4 — Different sessions may get different models
|
||||
|
||||
```bash
|
||||
for session in session-A session-B session-C; do
|
||||
model=$(curl -s http://localhost:12000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "x-model-affinity: $session" \
|
||||
-d '{"model":"openai/gpt-4o-mini","messages":[{"role":"user","content":"Explain quantum entanglement in detail with equations."}]}' \
|
||||
| jq -r '.model')
|
||||
echo "$session -> $model"
|
||||
done
|
||||
```
|
||||
|
||||
Sessions with content matched to `deep_reasoning` will pin to `openai/gpt-5.2`;
|
||||
sessions matched to `fast_responses` will pin to `openai/gpt-4o-mini`.
|
||||
|
||||
## Verification Script Output
|
||||
|
||||
Running `python verify_affinity.py` produces output like:
|
||||
|
||||
```
|
||||
Plano endpoint : http://localhost:12000/v1/chat/completions
|
||||
Sessions : 3
|
||||
Rounds/session : 4
|
||||
|
||||
============================================================
|
||||
Phase 1: Requests WITHOUT X-Model-Affinity header
|
||||
(model may vary between requests — that is expected)
|
||||
============================================================
|
||||
Request 1: model = openai/gpt-4o-mini
|
||||
Request 2: model = openai/gpt-5.2
|
||||
Request 3: model = openai/gpt-4o-mini
|
||||
Models seen across 3 requests: {'openai/gpt-4o-mini', 'openai/gpt-5.2'}
|
||||
|
||||
============================================================
|
||||
Phase 2: Requests WITH X-Model-Affinity (session pinning)
|
||||
Each session should be pinned to exactly one model.
|
||||
============================================================
|
||||
|
||||
Session 'demo-session-001':
|
||||
Round 1: model = openai/gpt-4o-mini [FIRST — sets affinity]
|
||||
Round 2: model = openai/gpt-4o-mini [PINNED]
|
||||
Round 3: model = openai/gpt-4o-mini [PINNED]
|
||||
Round 4: model = openai/gpt-4o-mini [PINNED]
|
||||
|
||||
Session 'demo-session-002':
|
||||
Round 1: model = openai/gpt-5.2 [FIRST — sets affinity]
|
||||
Round 2: model = openai/gpt-5.2 [PINNED]
|
||||
Round 3: model = openai/gpt-5.2 [PINNED]
|
||||
Round 4: model = openai/gpt-5.2 [PINNED]
|
||||
|
||||
Session 'demo-session-003':
|
||||
Round 1: model = openai/gpt-4o-mini [FIRST — sets affinity]
|
||||
Round 2: model = openai/gpt-4o-mini [PINNED]
|
||||
Round 3: model = openai/gpt-4o-mini [PINNED]
|
||||
Round 4: model = openai/gpt-4o-mini [PINNED]
|
||||
|
||||
============================================================
|
||||
Results
|
||||
============================================================
|
||||
PASS demo-session-001 -> always routed to 'openai/gpt-4o-mini'
|
||||
PASS demo-session-002 -> always routed to 'openai/gpt-5.2'
|
||||
PASS demo-session-003 -> always routed to 'openai/gpt-4o-mini'
|
||||
|
||||
All sessions were pinned consistently.
|
||||
Redis session cache is working correctly.
|
||||
```
|
||||
|
||||
## Observability
|
||||
|
||||
Open Jaeger at **http://localhost:16686** and select service `plano`.
|
||||
|
||||
- Requests **without** affinity: look for a span to the Arch-Router service
|
||||
- Requests **with** affinity (pinned): the Arch-Router span will be absent —
|
||||
the decision was served from Redis without calling the router at all
|
||||
|
||||
This is the clearest observable signal that the cache is working: pinned
|
||||
requests are noticeably faster and produce fewer spans.
|
||||
|
||||
## Switching to the In-Memory Backend
|
||||
|
||||
To compare against the default in-memory backend, change `config.yaml`:
|
||||
|
||||
```yaml
|
||||
routing:
|
||||
session_ttl_seconds: 300
|
||||
session_cache:
|
||||
type: memory # ← change this
|
||||
```
|
||||
|
||||
In-memory mode does **not** require Redis and works identically for a
|
||||
single Plano process. The difference only becomes visible when you run
|
||||
multiple replicas.
|
||||
|
||||
## Teardown
|
||||
|
||||
```bash
|
||||
./run_demo.sh down
|
||||
```
|
||||
|
||||
This stops Plano, Redis, and Jaeger.
|
||||
36
demos/llm_routing/session_affinity_redis/config.yaml
Normal file
36
demos/llm_routing/session_affinity_redis/config.yaml
Normal file
|
|
@ -0,0 +1,36 @@
|
|||
version: v0.4.0
|
||||
|
||||
listeners:
|
||||
- type: model
|
||||
name: model_listener
|
||||
port: 12000
|
||||
|
||||
model_providers:
|
||||
- model: openai/gpt-4o-mini
|
||||
access_key: $OPENAI_API_KEY
|
||||
default: true
|
||||
|
||||
- model: openai/gpt-5.2
|
||||
access_key: $OPENAI_API_KEY
|
||||
|
||||
routing_preferences:
|
||||
- name: fast_responses
|
||||
description: short factual questions, quick lookups, simple summarization, or greetings
|
||||
models:
|
||||
- openai/gpt-4o-mini
|
||||
|
||||
- name: deep_reasoning
|
||||
description: multi-step reasoning, complex analysis, code review, or detailed explanations
|
||||
models:
|
||||
- openai/gpt-5.2
|
||||
- openai/gpt-4o-mini
|
||||
|
||||
routing:
|
||||
session_ttl_seconds: 300
|
||||
session_cache:
|
||||
type: redis
|
||||
url: redis://localhost:6379
|
||||
|
||||
tracing:
|
||||
random_sampling: 100
|
||||
trace_arch_internal: true
|
||||
23
demos/llm_routing/session_affinity_redis/docker-compose.yaml
Normal file
23
demos/llm_routing/session_affinity_redis/docker-compose.yaml
Normal file
|
|
@ -0,0 +1,23 @@
|
|||
services:
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
container_name: plano-session-redis
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "6379:6379"
|
||||
command: redis-server --save "" --appendonly no
|
||||
healthcheck:
|
||||
test: ["CMD", "redis-cli", "ping"]
|
||||
interval: 1s
|
||||
timeout: 1s
|
||||
retries: 10
|
||||
|
||||
jaeger:
|
||||
build:
|
||||
context: ../../shared/jaeger
|
||||
container_name: plano-session-jaeger
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "16686:16686"
|
||||
- "4317:4317"
|
||||
- "4318:4318"
|
||||
94
demos/llm_routing/session_affinity_redis/run_demo.sh
Executable file
94
demos/llm_routing/session_affinity_redis/run_demo.sh
Executable file
|
|
@ -0,0 +1,94 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
DEMO_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
|
||||
load_env() {
|
||||
if [ -f "$DEMO_DIR/.env" ]; then
|
||||
set -a
|
||||
# shellcheck disable=SC1091
|
||||
source "$DEMO_DIR/.env"
|
||||
set +a
|
||||
fi
|
||||
}
|
||||
|
||||
check_prereqs() {
|
||||
local missing=()
|
||||
command -v docker >/dev/null 2>&1 || missing+=("docker")
|
||||
command -v planoai >/dev/null 2>&1 || missing+=("planoai (pip install planoai)")
|
||||
if [ ${#missing[@]} -gt 0 ]; then
|
||||
echo "ERROR: missing required tools: ${missing[*]}"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [ -z "${OPENAI_API_KEY:-}" ]; then
|
||||
echo "ERROR: OPENAI_API_KEY is not set."
|
||||
echo " Create a .env file or export the variable before running."
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
start_demo() {
|
||||
echo "==> Starting Redis + Jaeger..."
|
||||
docker compose -f "$DEMO_DIR/docker-compose.yaml" up -d
|
||||
|
||||
echo "==> Waiting for Redis to be ready..."
|
||||
local retries=0
|
||||
until docker exec plano-session-redis redis-cli ping 2>/dev/null | grep -q PONG; do
|
||||
retries=$((retries + 1))
|
||||
if [ $retries -ge 15 ]; then
|
||||
echo "ERROR: Redis did not become ready in time"
|
||||
exit 1
|
||||
fi
|
||||
sleep 1
|
||||
done
|
||||
echo " Redis is ready."
|
||||
|
||||
echo "==> Starting Plano..."
|
||||
planoai up "$DEMO_DIR/config.yaml"
|
||||
|
||||
echo ""
|
||||
echo "Demo is running!"
|
||||
echo ""
|
||||
echo " Model endpoint: http://localhost:12000/v1/chat/completions"
|
||||
echo " Jaeger UI: http://localhost:16686"
|
||||
echo " Redis: localhost:6379"
|
||||
echo ""
|
||||
echo "Run the verification script to confirm session pinning:"
|
||||
echo " python $DEMO_DIR/verify_affinity.py"
|
||||
echo ""
|
||||
echo "Stop the demo with: $0 down"
|
||||
}
|
||||
|
||||
stop_demo() {
|
||||
echo "==> Stopping Plano..."
|
||||
planoai down 2>/dev/null || true
|
||||
|
||||
echo "==> Stopping Docker services..."
|
||||
docker compose -f "$DEMO_DIR/docker-compose.yaml" down
|
||||
|
||||
echo "Demo stopped."
|
||||
}
|
||||
|
||||
usage() {
|
||||
echo "Usage: $0 [up|down]"
|
||||
echo ""
|
||||
echo " up Start Redis, Jaeger, and Plano (default)"
|
||||
echo " down Stop all services"
|
||||
}
|
||||
|
||||
load_env
|
||||
|
||||
case "${1:-up}" in
|
||||
up)
|
||||
check_prereqs
|
||||
start_demo
|
||||
;;
|
||||
down)
|
||||
stop_demo
|
||||
;;
|
||||
*)
|
||||
usage
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
146
demos/llm_routing/session_affinity_redis/verify_affinity.py
Normal file
146
demos/llm_routing/session_affinity_redis/verify_affinity.py
Normal file
|
|
@ -0,0 +1,146 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
verify_affinity.py — Verify that model affinity (session pinning) works correctly.
|
||||
|
||||
Sends multiple requests with the same X-Model-Affinity session ID and asserts
|
||||
that every response is served by the same model, demonstrating that Plano's
|
||||
session cache is working as expected.
|
||||
|
||||
Usage:
|
||||
python verify_affinity.py [--url URL] [--rounds N] [--sessions N]
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
import urllib.error
|
||||
import urllib.request
|
||||
from collections import defaultdict
|
||||
|
||||
PLANO_URL = "http://localhost:12000/v1/chat/completions"
|
||||
|
||||
PROMPTS = [
|
||||
"What is 2 + 2?",
|
||||
"Name the capital of France.",
|
||||
"How many days in a week?",
|
||||
"What color is the sky?",
|
||||
"Who wrote Romeo and Juliet?",
|
||||
]
|
||||
|
||||
MESSAGES_PER_SESSION = [{"role": "user", "content": prompt} for prompt in PROMPTS]
|
||||
|
||||
|
||||
def chat(url: str, session_id: str | None, message: str) -> dict:
|
||||
payload = json.dumps(
|
||||
{
|
||||
"model": "openai/gpt-4o-mini",
|
||||
"messages": [{"role": "user", "content": message}],
|
||||
}
|
||||
).encode()
|
||||
|
||||
headers = {"Content-Type": "application/json"}
|
||||
if session_id:
|
||||
headers["x-model-affinity"] = session_id
|
||||
|
||||
req = urllib.request.Request(url, data=payload, headers=headers, method="POST")
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=30) as resp:
|
||||
return json.loads(resp.read())
|
||||
except urllib.error.URLError as e:
|
||||
print(f" ERROR: could not reach Plano at {url}: {e}", file=sys.stderr)
|
||||
print(" Is the demo running? Start it with: ./run_demo.sh up", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def extract_model(response: dict) -> str:
|
||||
return response.get("model", "<unknown>")
|
||||
|
||||
|
||||
def run_verification(url: str, rounds: int, num_sessions: int) -> bool:
|
||||
print(f"Plano endpoint : {url}")
|
||||
print(f"Sessions : {num_sessions}")
|
||||
print(f"Rounds/session : {rounds}")
|
||||
print()
|
||||
|
||||
all_passed = True
|
||||
|
||||
# --- Phase 1: Requests without session ID ---
|
||||
print("=" * 60)
|
||||
print("Phase 1: Requests WITHOUT X-Model-Affinity header")
|
||||
print(" (model may vary between requests — that is expected)")
|
||||
print("=" * 60)
|
||||
models_seen: set[str] = set()
|
||||
for i in range(min(rounds, 3)):
|
||||
resp = chat(url, None, PROMPTS[i % len(PROMPTS)])
|
||||
model = extract_model(resp)
|
||||
models_seen.add(model)
|
||||
print(f" Request {i + 1}: model = {model}")
|
||||
print(f" Models seen across {min(rounds, 3)} requests: {models_seen}")
|
||||
print()
|
||||
|
||||
# --- Phase 2: Each session should always get the same model ---
|
||||
print("=" * 60)
|
||||
print("Phase 2: Requests WITH X-Model-Affinity (session pinning)")
|
||||
print(" Each session should be pinned to exactly one model.")
|
||||
print("=" * 60)
|
||||
|
||||
session_results: dict[str, list[str]] = defaultdict(list)
|
||||
|
||||
for s in range(num_sessions):
|
||||
session_id = f"demo-session-{s + 1:03d}"
|
||||
print(f"\n Session '{session_id}':")
|
||||
|
||||
for r in range(rounds):
|
||||
resp = chat(url, session_id, PROMPTS[r % len(PROMPTS)])
|
||||
model = extract_model(resp)
|
||||
session_results[session_id].append(model)
|
||||
pinned = " [PINNED]" if r > 0 else " [FIRST — sets affinity]"
|
||||
print(f" Round {r + 1}: model = {model}{pinned}")
|
||||
|
||||
print()
|
||||
print("=" * 60)
|
||||
print("Results")
|
||||
print("=" * 60)
|
||||
|
||||
for session_id, models in session_results.items():
|
||||
unique_models = set(models)
|
||||
if len(unique_models) == 1:
|
||||
print(f" PASS {session_id} -> always routed to '{models[0]}'")
|
||||
else:
|
||||
print(
|
||||
f" FAIL {session_id} -> inconsistent models across rounds: {unique_models}"
|
||||
)
|
||||
all_passed = False
|
||||
|
||||
print()
|
||||
if all_passed:
|
||||
print("All sessions were pinned consistently.")
|
||||
print("Redis session cache is working correctly.")
|
||||
else:
|
||||
print("One or more sessions were NOT pinned consistently.")
|
||||
print("Check that Redis is running and Plano is configured with:")
|
||||
print(" routing:")
|
||||
print(" session_cache:")
|
||||
print(" type: redis")
|
||||
print(" url: redis://localhost:6379")
|
||||
|
||||
return all_passed
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description=__doc__)
|
||||
parser.add_argument("--url", default=PLANO_URL, help="Plano chat completions URL")
|
||||
parser.add_argument(
|
||||
"--rounds", type=int, default=4, help="Requests per session (default 4)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--sessions", type=int, default=3, help="Number of sessions to test (default 3)"
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
passed = run_verification(args.url, args.rounds, args.sessions)
|
||||
sys.exit(0 if passed else 1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Loading…
Add table
Add a link
Reference in a new issue