mirror of
https://github.com/katanemo/plano.git
synced 2026-06-17 15:25:17 +02:00
fix nodeSelector placeholder, add provider examples and cluster check step
This commit is contained in:
parent
8dc9744c84
commit
1d4d212c4c
3 changed files with 23 additions and 23 deletions
|
|
@ -107,22 +107,23 @@ The response tells you which model would handle this request and which route was
|
|||
|
||||
To run Arch-Router in-cluster using vLLM instead of the default hosted endpoint:
|
||||
|
||||
**1. Update `vllm-deployment.yaml`** — set `nodeSelector` to match your GPU node's labels:
|
||||
|
||||
```yaml
|
||||
# Examples:
|
||||
# GKE: cloud.google.com/gke-accelerator: nvidia-l4
|
||||
# EKS: eks.amazonaws.com/nodegroup: gpu-nodes
|
||||
# AKS: kubernetes.azure.com/agentpool: gpupool
|
||||
nodeSelector:
|
||||
node.kubernetes.io/instance-type: gpu-node
|
||||
```
|
||||
|
||||
**2. Deploy Arch-Router and Plano:**
|
||||
**0. Check your GPU node labels and taints**
|
||||
|
||||
```bash
|
||||
kubectl get nodes --show-labels | grep -i gpu
|
||||
kubectl get node <gpu-node-name> -o jsonpath='{.spec.taints}'
|
||||
```
|
||||
|
||||
GPU nodes commonly have a `nvidia.com/gpu:NoSchedule` taint — `vllm-deployment.yaml` includes a matching toleration. If you have multiple GPU node pools and need to pin to a specific one, uncomment and set the `nodeSelector` in `vllm-deployment.yaml` using the label for your cloud provider.
|
||||
|
||||
**1. Deploy Arch-Router and Plano:**
|
||||
|
||||
```bash
|
||||
|
||||
# arch-router deployment
|
||||
kubectl apply -f vllm-deployment.yaml
|
||||
|
||||
# plano deployment
|
||||
kubectl create secret generic plano-secrets \
|
||||
--from-literal=OPENAI_API_KEY=$OPENAI_API_KEY \
|
||||
--from-literal=ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY
|
||||
|
|
|
|||
|
|
@ -18,13 +18,13 @@ spec:
|
|||
- key: nvidia.com/gpu
|
||||
operator: Exists
|
||||
effect: NoSchedule
|
||||
nodeSelector:
|
||||
# Replace with the label that identifies GPU nodes in your cluster
|
||||
# Examples:
|
||||
# GKE: cloud.google.com/gke-accelerator: nvidia-l4
|
||||
# EKS: eks.amazonaws.com/nodegroup: gpu-nodes
|
||||
# AKS: kubernetes.azure.com/agentpool: gpupool
|
||||
node.kubernetes.io/instance-type: gpu-node
|
||||
# Optional: add a nodeSelector to pin to a specific GPU node pool.
|
||||
# The nvidia.com/gpu resource request below is sufficient for most clusters.
|
||||
# nodeSelector:
|
||||
# DigitalOcean: doks.digitalocean.com/gpu-model: l40s
|
||||
# GKE: cloud.google.com/gke-accelerator: nvidia-l4
|
||||
# EKS: eks.amazonaws.com/nodegroup: gpu-nodes
|
||||
# AKS: kubernetes.azure.com/agentpool: gpupool
|
||||
initContainers:
|
||||
- name: download-model
|
||||
image: python:3.11-slim
|
||||
|
|
|
|||
|
|
@ -362,10 +362,9 @@ The ``demos/llm_routing/model_routing_service/`` directory includes ready-to-use
|
|||
Key things to know before deploying:
|
||||
|
||||
- GPU nodes commonly have a ``nvidia.com/gpu:NoSchedule`` taint — the ``vllm-deployment.yaml``
|
||||
includes a matching toleration. Update the ``nodeSelector`` to match your cluster's GPU node
|
||||
labels (GKE, EKS, AKS each use different label keys).
|
||||
- The ``nvidia.com/gpu: "1"`` resource request alone is sufficient for scheduling, but a
|
||||
``nodeSelector`` is recommended when you have mixed node pools.
|
||||
includes a matching toleration. The ``nvidia.com/gpu: "1"`` resource request is sufficient
|
||||
for scheduling in most clusters; a ``nodeSelector`` is optional and commented out in the
|
||||
manifest for cases where you need to pin to a specific GPU node pool.
|
||||
- Model download takes ~1 minute; vLLM loads the model in ~1-2 minutes after that. The
|
||||
``livenessProbe`` has a 180-second ``initialDelaySeconds`` to avoid premature restarts.
|
||||
- The Plano config ConfigMap must use ``--from-file=plano_config.yaml=config_k8s.yaml`` with
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue