mirror of
https://github.com/katanemo/plano.git
synced 2026-07-02 15:51:02 +02:00
add self-hosting docs and demo for Plano-Orchestrator
This commit is contained in:
parent
8edf686665
commit
747946fb39
4 changed files with 192 additions and 2 deletions
|
|
@ -123,6 +123,42 @@ Each agent:
|
||||||
|
|
||||||
Both agents run as native local processes and communicate with Plano running natively on the host.
|
Both agents run as native local processes and communicate with Plano running natively on the host.
|
||||||
|
|
||||||
|
## Running with local Plano-Orchestrator (via vLLM)
|
||||||
|
|
||||||
|
By default, Plano uses a hosted Plano-Orchestrator endpoint. To self-host the orchestrator model locally using vLLM on a server with an NVIDIA GPU:
|
||||||
|
|
||||||
|
1. Install vLLM and download the model:
|
||||||
|
```bash
|
||||||
|
pip install vllm
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Start the vLLM server with the 4B model:
|
||||||
|
```bash
|
||||||
|
vllm serve katanemo/Plano-Orchestrator-4B \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port 8000 \
|
||||||
|
--tensor-parallel-size 1 \
|
||||||
|
--gpu-memory-utilization 0.3 \
|
||||||
|
--tokenizer katanemo/Plano-Orchestrator-4B \
|
||||||
|
--chat-template chat_template.jinja \
|
||||||
|
--served-model-name Plano-Orchestrator \
|
||||||
|
--enable-prefix-caching
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Start the demo with the local orchestrator config:
|
||||||
|
```bash
|
||||||
|
./run_demo.sh --local-orchestrator
|
||||||
|
```
|
||||||
|
|
||||||
|
4. Test with curl:
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:8001/v1/chat/completions \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"model": "gpt-5.2", "messages": [{"role": "user", "content": "What is the weather in Istanbul?"}]}'
|
||||||
|
```
|
||||||
|
|
||||||
|
You should see Plano use your local orchestrator to route the request to the weather agent.
|
||||||
|
|
||||||
## Observability
|
## Observability
|
||||||
|
|
||||||
This demo includes full OpenTelemetry (OTel) compatible distributed tracing to monitor and debug agent interactions:
|
This demo includes full OpenTelemetry (OTel) compatible distributed tracing to monitor and debug agent interactions:
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,68 @@
|
||||||
|
version: v0.3.0
|
||||||
|
|
||||||
|
orchestration:
|
||||||
|
model: Plano-Orchestrator
|
||||||
|
llm_provider: plano-orchestrator
|
||||||
|
|
||||||
|
agents:
|
||||||
|
- id: weather_agent
|
||||||
|
url: http://localhost:10510
|
||||||
|
- id: flight_agent
|
||||||
|
url: http://localhost:10520
|
||||||
|
|
||||||
|
model_providers:
|
||||||
|
- name: plano-orchestrator
|
||||||
|
model: Plano-Orchestrator
|
||||||
|
base_url: http://localhost:8000
|
||||||
|
|
||||||
|
- model: openai/gpt-5.2
|
||||||
|
access_key: $OPENAI_API_KEY
|
||||||
|
default: true
|
||||||
|
- model: openai/gpt-4o-mini
|
||||||
|
access_key: $OPENAI_API_KEY # smaller, faster, cheaper model for extracting entities like location
|
||||||
|
|
||||||
|
listeners:
|
||||||
|
- type: agent
|
||||||
|
name: travel_booking_service
|
||||||
|
port: 8001
|
||||||
|
router: plano_orchestrator_v1
|
||||||
|
agents:
|
||||||
|
- id: weather_agent
|
||||||
|
description: |
|
||||||
|
|
||||||
|
WeatherAgent is a specialized AI assistant for real-time weather information and forecasts. It provides accurate weather data for any city worldwide using the Open-Meteo API, helping travelers plan their trips with up-to-date weather conditions.
|
||||||
|
|
||||||
|
Capabilities:
|
||||||
|
* Get real-time weather conditions and multi-day forecasts for any city worldwide using Open-Meteo API (free, no API key needed)
|
||||||
|
* Provides current temperature
|
||||||
|
* Provides multi-day forecasts
|
||||||
|
* Provides weather conditions
|
||||||
|
* Provides sunrise/sunset times
|
||||||
|
* Provides detailed weather information
|
||||||
|
* Understands conversation context to resolve location references from previous messages
|
||||||
|
* Handles weather-related questions including "What's the weather in [city]?", "What's the forecast for [city]?", "How's the weather in [city]?"
|
||||||
|
* When queries include both weather and other travel questions (e.g., flights, currency), this agent answers ONLY the weather part
|
||||||
|
|
||||||
|
- id: flight_agent
|
||||||
|
description: |
|
||||||
|
|
||||||
|
FlightAgent is an AI-powered tool specialized in providing live flight information between airports. It leverages the FlightAware AeroAPI to deliver real-time flight status, gate information, and delay updates.
|
||||||
|
|
||||||
|
Capabilities:
|
||||||
|
* Get live flight information between airports using FlightAware AeroAPI
|
||||||
|
* Shows real-time flight status
|
||||||
|
* Shows scheduled/estimated/actual departure and arrival times
|
||||||
|
* Shows gate and terminal information
|
||||||
|
* Shows delays
|
||||||
|
* Shows aircraft type
|
||||||
|
* Shows flight status
|
||||||
|
* Automatically resolves city names to airport codes (IATA/ICAO)
|
||||||
|
* Understands conversation context to infer origin/destination from follow-up questions
|
||||||
|
* Handles flight-related questions including "What flights go from [city] to [city]?", "Do flights go to [city]?", "Are there direct flights from [city]?"
|
||||||
|
* When queries include both flight and other travel questions (e.g., weather, currency), this agent answers ONLY the flight part
|
||||||
|
|
||||||
|
tracing:
|
||||||
|
random_sampling: 100
|
||||||
|
span_attributes:
|
||||||
|
header_prefixes:
|
||||||
|
- x-acme-
|
||||||
|
|
@ -31,8 +31,13 @@ start_demo() {
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# Step 4: Start Plano
|
# Step 4: Start Plano
|
||||||
echo "Starting Plano with config.yaml..."
|
PLANO_CONFIG="config.yaml"
|
||||||
planoai up config.yaml
|
if [ "$1" == "--local-orchestrator" ]; then
|
||||||
|
PLANO_CONFIG="config_local_orchestrator.yaml"
|
||||||
|
echo "Using local orchestrator config..."
|
||||||
|
fi
|
||||||
|
echo "Starting Plano with $PLANO_CONFIG..."
|
||||||
|
planoai up "$PLANO_CONFIG"
|
||||||
|
|
||||||
# Step 5: Start agents natively
|
# Step 5: Start agents natively
|
||||||
echo "Starting agents..."
|
echo "Starting agents..."
|
||||||
|
|
|
||||||
|
|
@ -335,6 +335,87 @@ Combine RAG agents for documentation lookup with specialized troubleshooting age
|
||||||
- id: troubleshoot_agent
|
- id: troubleshoot_agent
|
||||||
description: Diagnoses and resolves technical issues step by step
|
description: Diagnoses and resolves technical issues step by step
|
||||||
|
|
||||||
|
Self-hosting Plano-Orchestrator
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
By default, Plano uses a hosted Plano-Orchestrator endpoint. To self-host the orchestrator model, you can serve it using **vLLM** on a server with an NVIDIA GPU.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
vLLM requires a Linux server with an NVIDIA GPU (CUDA). For local development on macOS, a GGUF version for Ollama is coming soon.
|
||||||
|
|
||||||
|
Two model variants are available on HuggingFace:
|
||||||
|
|
||||||
|
* `Plano-Orchestrator-4B <https://huggingface.co/katanemo/Plano-Orchestrator-4B>`_ — lighter model, suitable for development and testing
|
||||||
|
* `Plano-Orchestrator-30B-A3B <https://huggingface.co/katanemo/Plano-Orchestrator-30B-A3B>`_ — full-size model for production (FP8 quantized variant also available)
|
||||||
|
|
||||||
|
Using vLLM
|
||||||
|
~~~~~~~~~~
|
||||||
|
|
||||||
|
1. **Install vLLM**
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
pip install vllm
|
||||||
|
|
||||||
|
2. **Download the model and chat template**
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
pip install huggingface_hub
|
||||||
|
huggingface-cli download katanemo/Plano-Orchestrator-4B
|
||||||
|
|
||||||
|
3. **Start the vLLM server**
|
||||||
|
|
||||||
|
For the 4B model (development):
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
vllm serve katanemo/Plano-Orchestrator-4B \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port 8000 \
|
||||||
|
--tensor-parallel-size 1 \
|
||||||
|
--gpu-memory-utilization 0.3 \
|
||||||
|
--tokenizer katanemo/Plano-Orchestrator-4B \
|
||||||
|
--chat-template chat_template.jinja \
|
||||||
|
--served-model-name Plano-Orchestrator \
|
||||||
|
--enable-prefix-caching
|
||||||
|
|
||||||
|
For the 30B-A3B-FP8 model (production):
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
vllm serve katanemo/Plano-Orchestrator-30B-A3B-FP8 \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port 8000 \
|
||||||
|
--tensor-parallel-size 1 \
|
||||||
|
--gpu-memory-utilization 0.9 \
|
||||||
|
--tokenizer katanemo/Plano-Orchestrator-30B-A3B-FP8 \
|
||||||
|
--chat-template chat_template.jinja \
|
||||||
|
--max-model-len 32768 \
|
||||||
|
--served-model-name Plano-Orchestrator \
|
||||||
|
--enable-prefix-caching
|
||||||
|
|
||||||
|
4. **Configure Plano to use the local orchestrator**
|
||||||
|
|
||||||
|
.. code-block:: yaml
|
||||||
|
|
||||||
|
orchestration:
|
||||||
|
model: Plano-Orchestrator
|
||||||
|
llm_provider: plano-orchestrator
|
||||||
|
|
||||||
|
model_providers:
|
||||||
|
- name: plano-orchestrator
|
||||||
|
model: Plano-Orchestrator
|
||||||
|
base_url: http://<your-server-ip>:8000
|
||||||
|
|
||||||
|
5. **Verify the server is running**
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
curl http://localhost:8000/health
|
||||||
|
curl http://localhost:8000/v1/models
|
||||||
|
|
||||||
|
|
||||||
Next Steps
|
Next Steps
|
||||||
----------
|
----------
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue