resolve merge conflict in main.rs

2026-05-21 13:55:15 +02:00 · 2026-03-16 12:40:33 -07:00 · 2026-03-16 12:40:33 -07:00 · 80dfb41cad
commit 80dfb41cad
parent 6fe7613bcd 5388c6777f
40 changed files with 920 additions and 301 deletions
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -17,7 +17,7 @@ from sphinxawesome_theme.postprocess import Icons
 project = "Plano Docs"
 copyright = "2025, Katanemo Labs, Inc"
 author = "Katanemo Labs, Inc"
-release = " v0.4.11"
+release = " v0.4.12"

 # -- General configuration ---------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
--- a/docs/source/get_started/quickstart.rst
+++ b/docs/source/get_started/quickstart.rst
@ -43,7 +43,7 @@ Plano's CLI allows you to manage and interact with the Plano efficiently. To ins

 .. code-block:: console

-   $ uv tool install planoai==0.4.11
+   $ uv tool install planoai==0.4.12

 **Option 2: Install with pip (Traditional)**

@ -51,7 +51,7 @@ Plano's CLI allows you to manage and interact with the Plano efficiently. To ins

   $ python -m venv venv
   $ source venv/bin/activate   # On Windows, use: venv\Scripts\activate
-   $ pip install planoai==0.4.11
+   $ pip install planoai==0.4.12


 .. _llm_routing_quickstart:
--- a/docs/source/guides/llm_router.rst
+++ b/docs/source/guides/llm_router.rst
@ -253,13 +253,11 @@ Using Ollama (recommended for local development)

   .. code-block:: yaml

-       routing:
-         model: Arch-Router
-         llm_provider: arch-router
+       overrides:
+         llm_routing_model: plano/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M

       model_providers:
-         - name: arch-router
-           model: arch/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
+         - model: plano/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
           base_url: http://localhost:11434

         - model: openai/gpt-5.2
@ -324,13 +322,11 @@ vLLM provides higher throughput and GPU optimizations suitable for production de

   .. code-block:: yaml

-       routing:
-         model: Arch-Router
-         llm_provider: arch-router
+       overrides:
+         llm_routing_model: plano/Arch-Router

       model_providers:
-         - name: arch-router
-           model: Arch-Router
+         - model: plano/Arch-Router
           base_url: http://<your-server-ip>:10000

         - model: openai/gpt-5.2
@ -351,6 +347,35 @@ vLLM provides higher throughput and GPU optimizations suitable for production de
       curl http://localhost:10000/v1/models


+Using vLLM on Kubernetes (GPU nodes)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For teams running Kubernetes, Arch-Router and Plano can be deployed as in-cluster services.
+The ``demos/llm_routing/model_routing_service/`` directory includes ready-to-use manifests:
+
+- ``vllm-deployment.yaml`` — Arch-Router served by vLLM, with an init container to download
+  the model from HuggingFace
+- ``plano-deployment.yaml`` — Plano proxy configured to use the in-cluster Arch-Router
+- ``config_k8s.yaml`` — Plano config with ``llm_routing_model`` pointing at
+  ``http://arch-router:10000`` instead of the default hosted endpoint
+
+Key things to know before deploying:
+
+- GPU nodes commonly have a ``nvidia.com/gpu:NoSchedule`` taint — the ``vllm-deployment.yaml``
+  includes a matching toleration. The ``nvidia.com/gpu: "1"`` resource request is sufficient
+  for scheduling in most clusters; a ``nodeSelector`` is optional and commented out in the
+  manifest for cases where you need to pin to a specific GPU node pool.
+- Model download takes ~1 minute; vLLM loads the model in ~1-2 minutes after that. The
+  ``livenessProbe`` has a 180-second ``initialDelaySeconds`` to avoid premature restarts.
+- The Plano config ConfigMap must use ``--from-file=plano_config.yaml=config_k8s.yaml`` with
+  ``subPath`` in the Deployment — omitting ``subPath`` causes Kubernetes to mount a directory
+  instead of a file.
+
+For the canonical Plano Kubernetes deployment (ConfigMap, Secrets, Deployment YAML), see
+:ref:`deployment`. For full step-by-step commands specific to this demo, see the
+`demo README <https://github.com/katanemo/plano/tree/main/demos/llm_routing/model_routing_service/README.md>`_.
+
+
 Combining Routing Methods
 -------------------------

--- a/docs/source/guides/orchestration.rst
+++ b/docs/source/guides/orchestration.rst
@ -335,6 +335,90 @@ Combine RAG agents for documentation lookup with specialized troubleshooting age
      - id: troubleshoot_agent
        description: Diagnoses and resolves technical issues step by step

+Self-hosting Plano-Orchestrator
+-------------------------------
+
+By default, Plano uses a hosted Plano-Orchestrator endpoint. To self-host the orchestrator model, you can serve it using **vLLM** on a server with an NVIDIA GPU.
+
+.. note::
+   vLLM requires a Linux server with an NVIDIA GPU (CUDA). For local development on macOS, a GGUF version for Ollama is coming soon.
+
+The following model variants are available on HuggingFace:
+
+* `Plano-Orchestrator-4B <https://huggingface.co/katanemo/Plano-Orchestrator-4B>`_ — lighter model, suitable for development and testing
+* `Plano-Orchestrator-4B-FP8 <https://huggingface.co/katanemo/Plano-Orchestrator-4B-FP8>`_ — FP8 quantized 4B model, lower memory usage
+* `Plano-Orchestrator-30B-A3B <https://huggingface.co/katanemo/Plano-Orchestrator-30B-A3B>`_ — full-size model for production
+* `Plano-Orchestrator-30B-A3B-FP8 <https://huggingface.co/katanemo/Plano-Orchestrator-30B-A3B-FP8>`_ — FP8 quantized 30B model, recommended for production deployments
+
+Using vLLM
+~~~~~~~~~~
+
+1. **Install vLLM**
+
+   .. code-block:: bash
+
+       pip install vllm
+
+2. **Download the model and chat template**
+
+   .. code-block:: bash
+
+       pip install huggingface_hub
+       huggingface-cli download katanemo/Plano-Orchestrator-4B
+
+3. **Start the vLLM server**
+
+   For the 4B model (development):
+
+   .. code-block:: bash
+
+       vllm serve katanemo/Plano-Orchestrator-4B \
+           --host 0.0.0.0 \
+           --port 8000 \
+           --tensor-parallel-size 1 \
+           --gpu-memory-utilization 0.3 \
+           --tokenizer katanemo/Plano-Orchestrator-4B \
+           --chat-template chat_template.jinja \
+           --served-model-name katanemo/Plano-Orchestrator-4B \
+           --enable-prefix-caching
+
+   For the 30B-A3B-FP8 model (production):
+
+   .. code-block:: bash
+
+       vllm serve katanemo/Plano-Orchestrator-30B-A3B-FP8 \
+           --host 0.0.0.0 \
+           --port 8000 \
+           --tensor-parallel-size 1 \
+           --gpu-memory-utilization 0.9 \
+           --tokenizer katanemo/Plano-Orchestrator-30B-A3B-FP8 \
+           --chat-template chat_template.jinja \
+           --max-model-len 32768 \
+           --served-model-name katanemo/Plano-Orchestrator-30B-A3B-FP8 \
+           --enable-prefix-caching
+
+4. **Configure Plano to use the local orchestrator**
+
+   Use the model name matching your ``--served-model-name``:
+
+   .. code-block:: yaml
+
+       overrides:
+         agent_orchestration_model: plano/katanemo/Plano-Orchestrator-4B
+
+       model_providers:
+         - model: katanemo/Plano-Orchestrator-4B
+           provider_interface: plano
+           base_url: http://<your-server-ip>:8000
+
+5. **Verify the server is running**
+
+   .. code-block:: bash
+
+       curl http://localhost:8000/health
+       curl http://localhost:8000/v1/models
+
+
 Next Steps
 ----------

--- a/docs/source/resources/deployment.rst
+++ b/docs/source/resources/deployment.rst
@ -65,7 +65,7 @@ Create a ``docker-compose.yml`` file with the following configuration:
   # docker-compose.yml
   services:
     plano:
-       image: katanemo/plano:0.4.11
+       image: katanemo/plano:0.4.12
       container_name: plano
       ports:
         - "10000:10000" # ingress (client -> plano)
@ -153,7 +153,7 @@ Create a ``plano-deployment.yaml``:
       spec:
         containers:
           - name: plano
-             image: katanemo/plano:0.4.11
+             image: katanemo/plano:0.4.12
             ports:
               - containerPort: 12000  # LLM gateway (chat completions, model routing)
                 name: llm-gateway
--- a/docs/source/resources/includes/plano_config_full_reference_rendered.yaml
+++ b/docs/source/resources/includes/plano_config_full_reference_rendered.yaml
@ -107,11 +107,11 @@ model_providers:
 - internal: true
  model: Arch-Function
  name: arch-function
-  provider_interface: arch
+  provider_interface: plano
 - internal: true
  model: Plano-Orchestrator
-  name: plano-orchestrator
-  provider_interface: arch
+  name: plano/orchestrator
+  provider_interface: plano
 prompt_targets:
 - description: Get current weather at a location.
  endpoint: