diff --git a/demos/llm_routing/model_routing_service/README.md b/demos/llm_routing/model_routing_service/README.md index 676de9e1..72b672f3 100644 --- a/demos/llm_routing/model_routing_service/README.md +++ b/demos/llm_routing/model_routing_service/README.md @@ -107,22 +107,23 @@ The response tells you which model would handle this request and which route was To run Arch-Router in-cluster using vLLM instead of the default hosted endpoint: -**1. Update `vllm-deployment.yaml`** — set `nodeSelector` to match your GPU node's labels: - -```yaml -# Examples: -# GKE: cloud.google.com/gke-accelerator: nvidia-l4 -# EKS: eks.amazonaws.com/nodegroup: gpu-nodes -# AKS: kubernetes.azure.com/agentpool: gpupool -nodeSelector: - node.kubernetes.io/instance-type: gpu-node -``` - -**2. Deploy Arch-Router and Plano:** +**0. Check your GPU node labels and taints** ```bash +kubectl get nodes --show-labels | grep -i gpu +kubectl get node -o jsonpath='{.spec.taints}' +``` + +GPU nodes commonly have a `nvidia.com/gpu:NoSchedule` taint — `vllm-deployment.yaml` includes a matching toleration. If you have multiple GPU node pools and need to pin to a specific one, uncomment and set the `nodeSelector` in `vllm-deployment.yaml` using the label for your cloud provider. + +**1. Deploy Arch-Router and Plano:** + +```bash + +# arch-router deployment kubectl apply -f vllm-deployment.yaml +# plano deployment kubectl create secret generic plano-secrets \ --from-literal=OPENAI_API_KEY=$OPENAI_API_KEY \ --from-literal=ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY diff --git a/demos/llm_routing/model_routing_service/vllm-deployment.yaml b/demos/llm_routing/model_routing_service/vllm-deployment.yaml index a2f40cf2..1debe15e 100644 --- a/demos/llm_routing/model_routing_service/vllm-deployment.yaml +++ b/demos/llm_routing/model_routing_service/vllm-deployment.yaml @@ -18,13 +18,13 @@ spec: - key: nvidia.com/gpu operator: Exists effect: NoSchedule - nodeSelector: - # Replace with the label that identifies GPU nodes in your cluster - # Examples: - # GKE: cloud.google.com/gke-accelerator: nvidia-l4 - # EKS: eks.amazonaws.com/nodegroup: gpu-nodes - # AKS: kubernetes.azure.com/agentpool: gpupool - node.kubernetes.io/instance-type: gpu-node + # Optional: add a nodeSelector to pin to a specific GPU node pool. + # The nvidia.com/gpu resource request below is sufficient for most clusters. + # nodeSelector: + # DigitalOcean: doks.digitalocean.com/gpu-model: l40s + # GKE: cloud.google.com/gke-accelerator: nvidia-l4 + # EKS: eks.amazonaws.com/nodegroup: gpu-nodes + # AKS: kubernetes.azure.com/agentpool: gpupool initContainers: - name: download-model image: python:3.11-slim diff --git a/docs/source/guides/llm_router.rst b/docs/source/guides/llm_router.rst index 2fceb112..7c4ad685 100644 --- a/docs/source/guides/llm_router.rst +++ b/docs/source/guides/llm_router.rst @@ -362,10 +362,9 @@ The ``demos/llm_routing/model_routing_service/`` directory includes ready-to-use Key things to know before deploying: - GPU nodes commonly have a ``nvidia.com/gpu:NoSchedule`` taint — the ``vllm-deployment.yaml`` - includes a matching toleration. Update the ``nodeSelector`` to match your cluster's GPU node - labels (GKE, EKS, AKS each use different label keys). -- The ``nvidia.com/gpu: "1"`` resource request alone is sufficient for scheduling, but a - ``nodeSelector`` is recommended when you have mixed node pools. + includes a matching toleration. The ``nvidia.com/gpu: "1"`` resource request is sufficient + for scheduling in most clusters; a ``nodeSelector`` is optional and commented out in the + manifest for cases where you need to pin to a specific GPU node pool. - Model download takes ~1 minute; vLLM loads the model in ~1-2 minutes after that. The ``livenessProbe`` has a 180-second ``initialDelaySeconds`` to avoid premature restarts. - The Plano config ConfigMap must use ``--from-file=plano_config.yaml=config_k8s.yaml`` with