Unified overrides for custom router and orchestrator models (#820)

* support configurable orchestrator model via orchestration config section * add self-hosting docs and demo for Plano-Orchestrator * list all Plano-Orchestrator model variants in docs * use overrides for custom routing and orchestration model * update docs * update orchestrator model name * rename arch provider to plano, use llm_routing_model and agent_orchestration_model * regenerate rendered config reference
2026-06-08 14:55:14 +02:00 · 2026-03-15 09:36:11 -07:00 · 2026-03-15 09:36:11 -07:00 · bc059aed4d
commit bc059aed4d
parent 785bf7e021
20 changed files with 312 additions and 103 deletions
--- a/docs/source/guides/llm_router.rst
+++ b/docs/source/guides/llm_router.rst
@ -253,13 +253,11 @@ Using Ollama (recommended for local development)

   .. code-block:: yaml

-       routing:
-         model: Arch-Router
-         llm_provider: arch-router
+       overrides:
+         llm_routing_model: plano/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M

       model_providers:
-         - name: arch-router
-           model: arch/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
+         - model: plano/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
           base_url: http://localhost:11434

         - model: openai/gpt-5.2
@ -324,13 +322,11 @@ vLLM provides higher throughput and GPU optimizations suitable for production de

   .. code-block:: yaml

-       routing:
-         model: Arch-Router
-         llm_provider: arch-router
+       overrides:
+         llm_routing_model: plano/Arch-Router

       model_providers:
-         - name: arch-router
-           model: Arch-Router
+         - model: plano/Arch-Router
           base_url: http://<your-server-ip>:10000

         - model: openai/gpt-5.2
--- a/docs/source/guides/orchestration.rst
+++ b/docs/source/guides/orchestration.rst
@ -335,6 +335,90 @@ Combine RAG agents for documentation lookup with specialized troubleshooting age
      - id: troubleshoot_agent
        description: Diagnoses and resolves technical issues step by step

+Self-hosting Plano-Orchestrator
+-------------------------------
+
+By default, Plano uses a hosted Plano-Orchestrator endpoint. To self-host the orchestrator model, you can serve it using **vLLM** on a server with an NVIDIA GPU.
+
+.. note::
+   vLLM requires a Linux server with an NVIDIA GPU (CUDA). For local development on macOS, a GGUF version for Ollama is coming soon.
+
+The following model variants are available on HuggingFace:
+
+* `Plano-Orchestrator-4B <https://huggingface.co/katanemo/Plano-Orchestrator-4B>`_ — lighter model, suitable for development and testing
+* `Plano-Orchestrator-4B-FP8 <https://huggingface.co/katanemo/Plano-Orchestrator-4B-FP8>`_ — FP8 quantized 4B model, lower memory usage
+* `Plano-Orchestrator-30B-A3B <https://huggingface.co/katanemo/Plano-Orchestrator-30B-A3B>`_ — full-size model for production
+* `Plano-Orchestrator-30B-A3B-FP8 <https://huggingface.co/katanemo/Plano-Orchestrator-30B-A3B-FP8>`_ — FP8 quantized 30B model, recommended for production deployments
+
+Using vLLM
+~~~~~~~~~~
+
+1. **Install vLLM**
+
+   .. code-block:: bash
+
+       pip install vllm
+
+2. **Download the model and chat template**
+
+   .. code-block:: bash
+
+       pip install huggingface_hub
+       huggingface-cli download katanemo/Plano-Orchestrator-4B
+
+3. **Start the vLLM server**
+
+   For the 4B model (development):
+
+   .. code-block:: bash
+
+       vllm serve katanemo/Plano-Orchestrator-4B \
+           --host 0.0.0.0 \
+           --port 8000 \
+           --tensor-parallel-size 1 \
+           --gpu-memory-utilization 0.3 \
+           --tokenizer katanemo/Plano-Orchestrator-4B \
+           --chat-template chat_template.jinja \
+           --served-model-name katanemo/Plano-Orchestrator-4B \
+           --enable-prefix-caching
+
+   For the 30B-A3B-FP8 model (production):
+
+   .. code-block:: bash
+
+       vllm serve katanemo/Plano-Orchestrator-30B-A3B-FP8 \
+           --host 0.0.0.0 \
+           --port 8000 \
+           --tensor-parallel-size 1 \
+           --gpu-memory-utilization 0.9 \
+           --tokenizer katanemo/Plano-Orchestrator-30B-A3B-FP8 \
+           --chat-template chat_template.jinja \
+           --max-model-len 32768 \
+           --served-model-name katanemo/Plano-Orchestrator-30B-A3B-FP8 \
+           --enable-prefix-caching
+
+4. **Configure Plano to use the local orchestrator**
+
+   Use the model name matching your ``--served-model-name``:
+
+   .. code-block:: yaml
+
+       overrides:
+         agent_orchestration_model: plano/katanemo/Plano-Orchestrator-4B
+
+       model_providers:
+         - model: katanemo/Plano-Orchestrator-4B
+           provider_interface: plano
+           base_url: http://<your-server-ip>:8000
+
+5. **Verify the server is running**
+
+   .. code-block:: bash
+
+       curl http://localhost:8000/health
+       curl http://localhost:8000/v1/models
+
+
 Next Steps
 ----------

--- a/docs/source/resources/includes/plano_config_full_reference_rendered.yaml
+++ b/docs/source/resources/includes/plano_config_full_reference_rendered.yaml
@ -107,11 +107,11 @@ model_providers:
 - internal: true
  model: Arch-Function
  name: arch-function
-  provider_interface: arch
+  provider_interface: plano
 - internal: true
  model: Plano-Orchestrator
-  name: plano-orchestrator
-  provider_interface: arch
+  name: plano/orchestrator
+  provider_interface: plano
 prompt_targets:
 - description: Get current weather at a location.
  endpoint: