mirror of
https://github.com/katanemo/plano.git
synced 2026-05-21 13:55:15 +02:00
resolve merge conflict in main.rs
This commit is contained in:
commit
80dfb41cad
40 changed files with 920 additions and 301 deletions
|
|
@ -17,7 +17,7 @@ from sphinxawesome_theme.postprocess import Icons
|
|||
project = "Plano Docs"
|
||||
copyright = "2025, Katanemo Labs, Inc"
|
||||
author = "Katanemo Labs, Inc"
|
||||
release = " v0.4.11"
|
||||
release = " v0.4.12"
|
||||
|
||||
# -- General configuration ---------------------------------------------------
|
||||
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
|
||||
|
|
|
|||
|
|
@ -43,7 +43,7 @@ Plano's CLI allows you to manage and interact with the Plano efficiently. To ins
|
|||
|
||||
.. code-block:: console
|
||||
|
||||
$ uv tool install planoai==0.4.11
|
||||
$ uv tool install planoai==0.4.12
|
||||
|
||||
**Option 2: Install with pip (Traditional)**
|
||||
|
||||
|
|
@ -51,7 +51,7 @@ Plano's CLI allows you to manage and interact with the Plano efficiently. To ins
|
|||
|
||||
$ python -m venv venv
|
||||
$ source venv/bin/activate # On Windows, use: venv\Scripts\activate
|
||||
$ pip install planoai==0.4.11
|
||||
$ pip install planoai==0.4.12
|
||||
|
||||
|
||||
.. _llm_routing_quickstart:
|
||||
|
|
|
|||
|
|
@ -253,13 +253,11 @@ Using Ollama (recommended for local development)
|
|||
|
||||
.. code-block:: yaml
|
||||
|
||||
routing:
|
||||
model: Arch-Router
|
||||
llm_provider: arch-router
|
||||
overrides:
|
||||
llm_routing_model: plano/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
|
||||
|
||||
model_providers:
|
||||
- name: arch-router
|
||||
model: arch/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
|
||||
- model: plano/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
|
||||
base_url: http://localhost:11434
|
||||
|
||||
- model: openai/gpt-5.2
|
||||
|
|
@ -324,13 +322,11 @@ vLLM provides higher throughput and GPU optimizations suitable for production de
|
|||
|
||||
.. code-block:: yaml
|
||||
|
||||
routing:
|
||||
model: Arch-Router
|
||||
llm_provider: arch-router
|
||||
overrides:
|
||||
llm_routing_model: plano/Arch-Router
|
||||
|
||||
model_providers:
|
||||
- name: arch-router
|
||||
model: Arch-Router
|
||||
- model: plano/Arch-Router
|
||||
base_url: http://<your-server-ip>:10000
|
||||
|
||||
- model: openai/gpt-5.2
|
||||
|
|
@ -351,6 +347,35 @@ vLLM provides higher throughput and GPU optimizations suitable for production de
|
|||
curl http://localhost:10000/v1/models
|
||||
|
||||
|
||||
Using vLLM on Kubernetes (GPU nodes)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
For teams running Kubernetes, Arch-Router and Plano can be deployed as in-cluster services.
|
||||
The ``demos/llm_routing/model_routing_service/`` directory includes ready-to-use manifests:
|
||||
|
||||
- ``vllm-deployment.yaml`` — Arch-Router served by vLLM, with an init container to download
|
||||
the model from HuggingFace
|
||||
- ``plano-deployment.yaml`` — Plano proxy configured to use the in-cluster Arch-Router
|
||||
- ``config_k8s.yaml`` — Plano config with ``llm_routing_model`` pointing at
|
||||
``http://arch-router:10000`` instead of the default hosted endpoint
|
||||
|
||||
Key things to know before deploying:
|
||||
|
||||
- GPU nodes commonly have a ``nvidia.com/gpu:NoSchedule`` taint — the ``vllm-deployment.yaml``
|
||||
includes a matching toleration. The ``nvidia.com/gpu: "1"`` resource request is sufficient
|
||||
for scheduling in most clusters; a ``nodeSelector`` is optional and commented out in the
|
||||
manifest for cases where you need to pin to a specific GPU node pool.
|
||||
- Model download takes ~1 minute; vLLM loads the model in ~1-2 minutes after that. The
|
||||
``livenessProbe`` has a 180-second ``initialDelaySeconds`` to avoid premature restarts.
|
||||
- The Plano config ConfigMap must use ``--from-file=plano_config.yaml=config_k8s.yaml`` with
|
||||
``subPath`` in the Deployment — omitting ``subPath`` causes Kubernetes to mount a directory
|
||||
instead of a file.
|
||||
|
||||
For the canonical Plano Kubernetes deployment (ConfigMap, Secrets, Deployment YAML), see
|
||||
:ref:`deployment`. For full step-by-step commands specific to this demo, see the
|
||||
`demo README <https://github.com/katanemo/plano/tree/main/demos/llm_routing/model_routing_service/README.md>`_.
|
||||
|
||||
|
||||
Combining Routing Methods
|
||||
-------------------------
|
||||
|
||||
|
|
|
|||
|
|
@ -335,6 +335,90 @@ Combine RAG agents for documentation lookup with specialized troubleshooting age
|
|||
- id: troubleshoot_agent
|
||||
description: Diagnoses and resolves technical issues step by step
|
||||
|
||||
Self-hosting Plano-Orchestrator
|
||||
-------------------------------
|
||||
|
||||
By default, Plano uses a hosted Plano-Orchestrator endpoint. To self-host the orchestrator model, you can serve it using **vLLM** on a server with an NVIDIA GPU.
|
||||
|
||||
.. note::
|
||||
vLLM requires a Linux server with an NVIDIA GPU (CUDA). For local development on macOS, a GGUF version for Ollama is coming soon.
|
||||
|
||||
The following model variants are available on HuggingFace:
|
||||
|
||||
* `Plano-Orchestrator-4B <https://huggingface.co/katanemo/Plano-Orchestrator-4B>`_ — lighter model, suitable for development and testing
|
||||
* `Plano-Orchestrator-4B-FP8 <https://huggingface.co/katanemo/Plano-Orchestrator-4B-FP8>`_ — FP8 quantized 4B model, lower memory usage
|
||||
* `Plano-Orchestrator-30B-A3B <https://huggingface.co/katanemo/Plano-Orchestrator-30B-A3B>`_ — full-size model for production
|
||||
* `Plano-Orchestrator-30B-A3B-FP8 <https://huggingface.co/katanemo/Plano-Orchestrator-30B-A3B-FP8>`_ — FP8 quantized 30B model, recommended for production deployments
|
||||
|
||||
Using vLLM
|
||||
~~~~~~~~~~
|
||||
|
||||
1. **Install vLLM**
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install vllm
|
||||
|
||||
2. **Download the model and chat template**
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install huggingface_hub
|
||||
huggingface-cli download katanemo/Plano-Orchestrator-4B
|
||||
|
||||
3. **Start the vLLM server**
|
||||
|
||||
For the 4B model (development):
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
vllm serve katanemo/Plano-Orchestrator-4B \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--tensor-parallel-size 1 \
|
||||
--gpu-memory-utilization 0.3 \
|
||||
--tokenizer katanemo/Plano-Orchestrator-4B \
|
||||
--chat-template chat_template.jinja \
|
||||
--served-model-name katanemo/Plano-Orchestrator-4B \
|
||||
--enable-prefix-caching
|
||||
|
||||
For the 30B-A3B-FP8 model (production):
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
vllm serve katanemo/Plano-Orchestrator-30B-A3B-FP8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--tensor-parallel-size 1 \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--tokenizer katanemo/Plano-Orchestrator-30B-A3B-FP8 \
|
||||
--chat-template chat_template.jinja \
|
||||
--max-model-len 32768 \
|
||||
--served-model-name katanemo/Plano-Orchestrator-30B-A3B-FP8 \
|
||||
--enable-prefix-caching
|
||||
|
||||
4. **Configure Plano to use the local orchestrator**
|
||||
|
||||
Use the model name matching your ``--served-model-name``:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
overrides:
|
||||
agent_orchestration_model: plano/katanemo/Plano-Orchestrator-4B
|
||||
|
||||
model_providers:
|
||||
- model: katanemo/Plano-Orchestrator-4B
|
||||
provider_interface: plano
|
||||
base_url: http://<your-server-ip>:8000
|
||||
|
||||
5. **Verify the server is running**
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
curl http://localhost:8000/health
|
||||
curl http://localhost:8000/v1/models
|
||||
|
||||
|
||||
Next Steps
|
||||
----------
|
||||
|
||||
|
|
|
|||
|
|
@ -65,7 +65,7 @@ Create a ``docker-compose.yml`` file with the following configuration:
|
|||
# docker-compose.yml
|
||||
services:
|
||||
plano:
|
||||
image: katanemo/plano:0.4.11
|
||||
image: katanemo/plano:0.4.12
|
||||
container_name: plano
|
||||
ports:
|
||||
- "10000:10000" # ingress (client -> plano)
|
||||
|
|
@ -153,7 +153,7 @@ Create a ``plano-deployment.yaml``:
|
|||
spec:
|
||||
containers:
|
||||
- name: plano
|
||||
image: katanemo/plano:0.4.11
|
||||
image: katanemo/plano:0.4.12
|
||||
ports:
|
||||
- containerPort: 12000 # LLM gateway (chat completions, model routing)
|
||||
name: llm-gateway
|
||||
|
|
|
|||
|
|
@ -107,11 +107,11 @@ model_providers:
|
|||
- internal: true
|
||||
model: Arch-Function
|
||||
name: arch-function
|
||||
provider_interface: arch
|
||||
provider_interface: plano
|
||||
- internal: true
|
||||
model: Plano-Orchestrator
|
||||
name: plano-orchestrator
|
||||
provider_interface: arch
|
||||
name: plano/orchestrator
|
||||
provider_interface: plano
|
||||
prompt_targets:
|
||||
- description: Get current weather at a location.
|
||||
endpoint:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue