deploy: 5388c6777f

2026-06-05 14:45:15 +02:00 · 2026-03-16 19:06:02 +00:00 · 2026-03-16 19:06:02 +00:00 · c042debbfa
commit c042debbfa
parent 498b2615d6
33 changed files with 92 additions and 33 deletions
--- a/includes/llms.txt
+++ b/includes/llms.txt
@ -1,6 +1,6 @@
 Plano Docs v0.4.12
 llms.txt (auto-generated)
-Generated (UTC): 2026-03-15T20:04:02.309985+00:00
+Generated (UTC): 2026-03-16T19:05:58.621874+00:00

 Table of contents
 - Agents (concepts/agents)
@ -3855,6 +3855,37 @@ Verify the server is running
 curl http://localhost:10000/health
 curl http://localhost:10000/v1/models

+Using vLLM on Kubernetes (GPU nodes)
+
+For teams running Kubernetes, Arch-Router and Plano can be deployed as in-cluster services.
+The demos/llm_routing/model_routing_service/ directory includes ready-to-use manifests:
+
+vllm-deployment.yaml — Arch-Router served by vLLM, with an init container to download
+the model from HuggingFace
+
+plano-deployment.yaml — Plano proxy configured to use the in-cluster Arch-Router
+
+config_k8s.yaml — Plano config with llm_routing_model pointing at
+http://arch-router:10000 instead of the default hosted endpoint
+
+Key things to know before deploying:
+
+GPU nodes commonly have a nvidia.com/gpu:NoSchedule taint — the vllm-deployment.yaml
+includes a matching toleration. The nvidia.com/gpu: "1" resource request is sufficient
+for scheduling in most clusters; a nodeSelector is optional and commented out in the
+manifest for cases where you need to pin to a specific GPU node pool.
+
+Model download takes ~1 minute; vLLM loads the model in ~1-2 minutes after that. The
+livenessProbe has a 180-second initialDelaySeconds to avoid premature restarts.
+
+The Plano config ConfigMap must use --from-file=plano_config.yaml=config_k8s.yaml with
+subPath in the Deployment — omitting subPath causes Kubernetes to mount a directory
+instead of a file.
+
+For the canonical Plano Kubernetes deployment (ConfigMap, Secrets, Deployment YAML), see
+deployment. For full step-by-step commands specific to this demo, see the
+demo README.
+
 Combining Routing Methods

 You can combine static model selection with dynamic routing preferences for maximum flexibility: