resolve merge conflict in main.rs

This commit is contained in:
Adil Hafeez 2026-03-16 12:40:33 -07:00
commit 80dfb41cad
No known key found for this signature in database
GPG key ID: 9B18EF7691369645
40 changed files with 920 additions and 301 deletions

View file

@ -17,7 +17,7 @@ from sphinxawesome_theme.postprocess import Icons
project = "Plano Docs"
copyright = "2025, Katanemo Labs, Inc"
author = "Katanemo Labs, Inc"
release = " v0.4.11"
release = " v0.4.12"
# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

View file

@ -43,7 +43,7 @@ Plano's CLI allows you to manage and interact with the Plano efficiently. To ins
.. code-block:: console
$ uv tool install planoai==0.4.11
$ uv tool install planoai==0.4.12
**Option 2: Install with pip (Traditional)**
@ -51,7 +51,7 @@ Plano's CLI allows you to manage and interact with the Plano efficiently. To ins
$ python -m venv venv
$ source venv/bin/activate # On Windows, use: venv\Scripts\activate
$ pip install planoai==0.4.11
$ pip install planoai==0.4.12
.. _llm_routing_quickstart:

View file

@ -253,13 +253,11 @@ Using Ollama (recommended for local development)
.. code-block:: yaml
routing:
model: Arch-Router
llm_provider: arch-router
overrides:
llm_routing_model: plano/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
model_providers:
- name: arch-router
model: arch/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
- model: plano/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
base_url: http://localhost:11434
- model: openai/gpt-5.2
@ -324,13 +322,11 @@ vLLM provides higher throughput and GPU optimizations suitable for production de
.. code-block:: yaml
routing:
model: Arch-Router
llm_provider: arch-router
overrides:
llm_routing_model: plano/Arch-Router
model_providers:
- name: arch-router
model: Arch-Router
- model: plano/Arch-Router
base_url: http://<your-server-ip>:10000
- model: openai/gpt-5.2
@ -351,6 +347,35 @@ vLLM provides higher throughput and GPU optimizations suitable for production de
curl http://localhost:10000/v1/models
Using vLLM on Kubernetes (GPU nodes)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For teams running Kubernetes, Arch-Router and Plano can be deployed as in-cluster services.
The ``demos/llm_routing/model_routing_service/`` directory includes ready-to-use manifests:
- ``vllm-deployment.yaml`` — Arch-Router served by vLLM, with an init container to download
the model from HuggingFace
- ``plano-deployment.yaml`` — Plano proxy configured to use the in-cluster Arch-Router
- ``config_k8s.yaml`` — Plano config with ``llm_routing_model`` pointing at
``http://arch-router:10000`` instead of the default hosted endpoint
Key things to know before deploying:
- GPU nodes commonly have a ``nvidia.com/gpu:NoSchedule`` taint — the ``vllm-deployment.yaml``
includes a matching toleration. The ``nvidia.com/gpu: "1"`` resource request is sufficient
for scheduling in most clusters; a ``nodeSelector`` is optional and commented out in the
manifest for cases where you need to pin to a specific GPU node pool.
- Model download takes ~1 minute; vLLM loads the model in ~1-2 minutes after that. The
``livenessProbe`` has a 180-second ``initialDelaySeconds`` to avoid premature restarts.
- The Plano config ConfigMap must use ``--from-file=plano_config.yaml=config_k8s.yaml`` with
``subPath`` in the Deployment — omitting ``subPath`` causes Kubernetes to mount a directory
instead of a file.
For the canonical Plano Kubernetes deployment (ConfigMap, Secrets, Deployment YAML), see
:ref:`deployment`. For full step-by-step commands specific to this demo, see the
`demo README <https://github.com/katanemo/plano/tree/main/demos/llm_routing/model_routing_service/README.md>`_.
Combining Routing Methods
-------------------------

View file

@ -335,6 +335,90 @@ Combine RAG agents for documentation lookup with specialized troubleshooting age
- id: troubleshoot_agent
description: Diagnoses and resolves technical issues step by step
Self-hosting Plano-Orchestrator
-------------------------------
By default, Plano uses a hosted Plano-Orchestrator endpoint. To self-host the orchestrator model, you can serve it using **vLLM** on a server with an NVIDIA GPU.
.. note::
vLLM requires a Linux server with an NVIDIA GPU (CUDA). For local development on macOS, a GGUF version for Ollama is coming soon.
The following model variants are available on HuggingFace:
* `Plano-Orchestrator-4B <https://huggingface.co/katanemo/Plano-Orchestrator-4B>`_ — lighter model, suitable for development and testing
* `Plano-Orchestrator-4B-FP8 <https://huggingface.co/katanemo/Plano-Orchestrator-4B-FP8>`_ — FP8 quantized 4B model, lower memory usage
* `Plano-Orchestrator-30B-A3B <https://huggingface.co/katanemo/Plano-Orchestrator-30B-A3B>`_ — full-size model for production
* `Plano-Orchestrator-30B-A3B-FP8 <https://huggingface.co/katanemo/Plano-Orchestrator-30B-A3B-FP8>`_ — FP8 quantized 30B model, recommended for production deployments
Using vLLM
~~~~~~~~~~
1. **Install vLLM**
.. code-block:: bash
pip install vllm
2. **Download the model and chat template**
.. code-block:: bash
pip install huggingface_hub
huggingface-cli download katanemo/Plano-Orchestrator-4B
3. **Start the vLLM server**
For the 4B model (development):
.. code-block:: bash
vllm serve katanemo/Plano-Orchestrator-4B \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.3 \
--tokenizer katanemo/Plano-Orchestrator-4B \
--chat-template chat_template.jinja \
--served-model-name katanemo/Plano-Orchestrator-4B \
--enable-prefix-caching
For the 30B-A3B-FP8 model (production):
.. code-block:: bash
vllm serve katanemo/Plano-Orchestrator-30B-A3B-FP8 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--tokenizer katanemo/Plano-Orchestrator-30B-A3B-FP8 \
--chat-template chat_template.jinja \
--max-model-len 32768 \
--served-model-name katanemo/Plano-Orchestrator-30B-A3B-FP8 \
--enable-prefix-caching
4. **Configure Plano to use the local orchestrator**
Use the model name matching your ``--served-model-name``:
.. code-block:: yaml
overrides:
agent_orchestration_model: plano/katanemo/Plano-Orchestrator-4B
model_providers:
- model: katanemo/Plano-Orchestrator-4B
provider_interface: plano
base_url: http://<your-server-ip>:8000
5. **Verify the server is running**
.. code-block:: bash
curl http://localhost:8000/health
curl http://localhost:8000/v1/models
Next Steps
----------

View file

@ -65,7 +65,7 @@ Create a ``docker-compose.yml`` file with the following configuration:
# docker-compose.yml
services:
plano:
image: katanemo/plano:0.4.11
image: katanemo/plano:0.4.12
container_name: plano
ports:
- "10000:10000" # ingress (client -> plano)
@ -153,7 +153,7 @@ Create a ``plano-deployment.yaml``:
spec:
containers:
- name: plano
image: katanemo/plano:0.4.11
image: katanemo/plano:0.4.12
ports:
- containerPort: 12000 # LLM gateway (chat completions, model routing)
name: llm-gateway

View file

@ -107,11 +107,11 @@ model_providers:
- internal: true
model: Arch-Function
name: arch-function
provider_interface: arch
provider_interface: plano
- internal: true
model: Plano-Orchestrator
name: plano-orchestrator
provider_interface: arch
name: plano/orchestrator
provider_interface: plano
prompt_targets:
- description: Get current weather at a location.
endpoint: