add k8s deployment manifests and docs for self-hosted Arch-Router

2026-05-30 14:25:15 +02:00 · 2026-03-16 11:08:07 -07:00 · 2026-03-16 11:08:07 -07:00 · 5b58bb60c3
commit 5b58bb60c3
parent f1b8c03e2f
7 changed files with 381 additions and 342 deletions
--- a/demos/llm_routing/model_routing_service/README.md
+++ b/demos/llm_routing/model_routing_service/README.md
@ -1,6 +1,54 @@
 # Model Routing Service Demo

-This demo shows how to use the `/routing/v1/*` endpoints to get routing decisions without proxying requests to an LLM. The endpoint accepts standard LLM request formats and returns which model Plano's router would select.
+Plano is an AI-native proxy and data plane for agentic apps — with built-in orchestration, safety, observability, and intelligent LLM routing.
+
+```
+┌───────────┐      ┌─────────────────────────────────┐      ┌──────────────┐
+│  Client   │ ───► │  Plano                          │ ───► │  OpenAI      │
+│  (any     │      │                                 │      │  Anthropic   │
+│  language)│      │  Arch-Router (1.5B model)       │      │  Any Provider│
+└───────────┘      │  analyzes intent → picks model  │      └──────────────┘
+                   └─────────────────────────────────┘
+```
+
+- **One endpoint, many models** — apps call Plano using standard OpenAI/Anthropic APIs; Plano handles provider selection, keys, and failover
+- **Intelligent routing** — a lightweight 1.5B router model classifies user intent and picks the best model per request
+- **Platform governance** — centralize API keys, rate limits, guardrails, and observability without touching app code
+- **Runs anywhere** — single binary; self-host the router for full data privacy
+
+## How Routing Works
+
+The entire routing configuration is plain YAML — no code:
+
+```yaml
+model_providers:
+  - model: openai/gpt-4o-mini
+    default: true                    # fallback for unmatched requests
+
+  - model: openai/gpt-4o
+    routing_preferences:
+      - name: complex_reasoning
+        description: complex reasoning tasks, multi-step analysis
+
+  - model: anthropic/claude-sonnet-4-20250514
+    routing_preferences:
+      - name: code_generation
+        description: generating new code, writing functions
+```
+
+When a request arrives, Plano sends the conversation and routing preferences to Arch-Router, which classifies the intent and returns the matching route:
+
+```
+1. Request arrives          → "Write binary search in Python"
+2. Preferences serialized   → [{"name":"code_generation", ...}, {"name":"complex_reasoning", ...}]
+3. Arch-Router classifies   → {"route": "code_generation"}
+4. Route → Model lookup     → code_generation → anthropic/claude-sonnet-4-20250514
+5. Request forwarded        → Claude generates the response
+```
+
+No match? Arch-Router returns `other` → Plano falls back to the default model.
+
+The `/routing/v1/*` endpoints return the routing decision **without** forwarding to the LLM — useful for testing and validating routing behavior before going to production.

 ## Setup

@ -55,6 +103,68 @@ Response:

 The response tells you which model would handle this request and which route was matched, without actually making the LLM call.

+## Kubernetes Deployment (Self-hosted Arch-Router on GPU)
+
+To run Arch-Router in-cluster using vLLM instead of the default hosted endpoint:
+
+**1. Update `vllm-deployment.yaml`** — set `nodeSelector` to match your GPU node's labels:
+
+```yaml
+# Examples:
+#   GKE: cloud.google.com/gke-accelerator: nvidia-l4
+#   EKS: eks.amazonaws.com/nodegroup: gpu-nodes
+#   AKS: kubernetes.azure.com/agentpool: gpupool
+nodeSelector:
+  node.kubernetes.io/instance-type: gpu-node
+```
+
+**2. Deploy Arch-Router and Plano:**
+
+```bash
+kubectl apply -f vllm-deployment.yaml
+
+kubectl create secret generic plano-secrets \
+  --from-literal=OPENAI_API_KEY=$OPENAI_API_KEY \
+  --from-literal=ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY
+
+kubectl create configmap plano-config \
+  --from-file=plano_config.yaml=config_k8s.yaml \
+  --dry-run=client -o yaml | kubectl apply -f -
+
+kubectl apply -f plano-deployment.yaml
+```
+
+**3. Wait for both pods to be ready:**
+
+```bash
+# Arch-Router downloads the model (~1 min) then vLLM loads it (~2 min)
+kubectl get pods -l app=arch-router -w
+kubectl rollout status deployment/plano
+```
+
+**4. Test:**
+
+```bash
+kubectl port-forward svc/plano 12000:12000
+./demo.sh
+```
+
+To confirm requests are hitting your in-cluster Arch-Router (not just health checks):
+
+```bash
+kubectl logs -l app=arch-router -f --tail=0
+# Look for POST /v1/chat/completions entries
+```
+
+**Updating the config:**
+
+```bash
+kubectl create configmap plano-config \
+  --from-file=plano_config.yaml=config_k8s.yaml \
+  --dry-run=client -o yaml | kubectl apply -f -
+kubectl rollout restart deployment/plano
+```
+
 ## Demo Output

 ```