add self-hosting docs and demo for Plano-Orchestrator

2026-07-02 15:51:02 +02:00 · 2026-03-11 15:46:47 -07:00 · 2026-03-11 15:46:47 -07:00 · 747946fb39
commit 747946fb39
parent 8edf686665
4 changed files with 192 additions and 2 deletions
--- a/demos/agent_orchestration/travel_agents/README.md
+++ b/demos/agent_orchestration/travel_agents/README.md
@ -123,6 +123,42 @@ Each agent:
 Both agents run as native local processes and communicate with Plano running natively on the host.
 ## Running with local Plano-Orchestrator (via vLLM)
 By default, Plano uses a hosted Plano-Orchestrator endpoint. To self-host the orchestrator model locally using vLLM on a server with an NVIDIA GPU:
 1. Install vLLM and download the model:
 ```bash
 pip install vllm
 ```
 2. Start the vLLM server with the 4B model:
 ```bash
 vllm serve katanemo/Plano-Orchestrator-4B \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.3 \
    --tokenizer katanemo/Plano-Orchestrator-4B \
    --chat-template chat_template.jinja \
    --served-model-name Plano-Orchestrator \
    --enable-prefix-caching
 ```
 3. Start the demo with the local orchestrator config:
 ```bash
 ./run_demo.sh --local-orchestrator
 ```
 4. Test with curl:
 ```bash
 curl -X POST http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-5.2", "messages": [{"role": "user", "content": "What is the weather in Istanbul?"}]}'
 ```
 You should see Plano use your local orchestrator to route the request to the weather agent.
 ## Observability
 This demo includes full OpenTelemetry (OTel) compatible distributed tracing to monitor and debug agent interactions:
--- a/demos/agent_orchestration/travel_agents/config_local_orchestrator.yaml
+++ b/demos/agent_orchestration/travel_agents/config_local_orchestrator.yaml
@ -0,0 +1,68 @@
 version: v0.3.0
 orchestration:
  model: Plano-Orchestrator
  llm_provider: plano-orchestrator
 agents:
  - id: weather_agent
    url: http://localhost:10510
  - id: flight_agent
    url: http://localhost:10520
 model_providers:
  - name: plano-orchestrator
    model: Plano-Orchestrator
    base_url: http://localhost:8000
  - model: openai/gpt-5.2
    access_key: $OPENAI_API_KEY
    default: true
  - model: openai/gpt-4o-mini
    access_key: $OPENAI_API_KEY # smaller, faster, cheaper model for extracting entities like location
 listeners:
  - type: agent
    name: travel_booking_service
    port: 8001
    router: plano_orchestrator_v1
    agents:
      - id: weather_agent
        description: |
          WeatherAgent is a specialized AI assistant for real-time weather information and forecasts. It provides accurate weather data for any city worldwide using the Open-Meteo API, helping travelers plan their trips with up-to-date weather conditions.
          Capabilities:
            * Get real-time weather conditions and multi-day forecasts for any city worldwide using Open-Meteo API (free, no API key needed)
            * Provides current temperature
            * Provides multi-day forecasts
            * Provides weather conditions
            * Provides sunrise/sunset times
            * Provides detailed weather information
            * Understands conversation context to resolve location references from previous messages
            * Handles weather-related questions including "What's the weather in [city]?", "What's the forecast for [city]?", "How's the weather in [city]?"
            * When queries include both weather and other travel questions (e.g., flights, currency), this agent answers ONLY the weather part
      - id: flight_agent
        description: |
          FlightAgent is an AI-powered tool specialized in providing live flight information between airports. It leverages the FlightAware AeroAPI to deliver real-time flight status, gate information, and delay updates.
          Capabilities:
            * Get live flight information between airports using FlightAware AeroAPI
            * Shows real-time flight status
            * Shows scheduled/estimated/actual departure and arrival times
            * Shows gate and terminal information
            * Shows delays
            * Shows aircraft type
            * Shows flight status
            * Automatically resolves city names to airport codes (IATA/ICAO)
            * Understands conversation context to infer origin/destination from follow-up questions
            * Handles flight-related questions including "What flights go from [city] to [city]?", "Do flights go to [city]?", "Are there direct flights from [city]?"
            * When queries include both flight and other travel questions (e.g., weather, currency), this agent answers ONLY the flight part
 tracing:
  random_sampling: 100
  span_attributes:
    header_prefixes:
      - x-acme-
--- a/demos/agent_orchestration/travel_agents/run_demo.sh
+++ b/demos/agent_orchestration/travel_agents/run_demo.sh
@ -31,8 +31,13 @@ start_demo() {
  fi
  # Step 4: Start Plano
-  echo "Starting Plano with config.yaml..."
+  PLANO_CONFIG="config.yaml"
-  planoai up config.yaml
+  if [ "$1" == "--local-orchestrator" ]; then
    PLANO_CONFIG="config_local_orchestrator.yaml"
    echo "Using local orchestrator config..."
  fi
  echo "Starting Plano with $PLANO_CONFIG..."
  planoai up "$PLANO_CONFIG"
  # Step 5: Start agents natively
  echo "Starting agents..."
--- a/docs/source/guides/orchestration.rst
+++ b/docs/source/guides/orchestration.rst
@ -335,6 +335,87 @@ Combine RAG agents for documentation lookup with specialized troubleshooting age
      - id: troubleshoot_agent
        description: Diagnoses and resolves technical issues step by step
 Self-hosting Plano-Orchestrator
 -------------------------------
 By default, Plano uses a hosted Plano-Orchestrator endpoint. To self-host the orchestrator model, you can serve it using **vLLM** on a server with an NVIDIA GPU.
 .. note::
   vLLM requires a Linux server with an NVIDIA GPU (CUDA). For local development on macOS, a GGUF version for Ollama is coming soon.
 Two model variants are available on HuggingFace:
 * `Plano-Orchestrator-4B <https://huggingface.co/katanemo/Plano-Orchestrator-4B>`_ — lighter model, suitable for development and testing
 * `Plano-Orchestrator-30B-A3B <https://huggingface.co/katanemo/Plano-Orchestrator-30B-A3B>`_ — full-size model for production (FP8 quantized variant also available)
 Using vLLM
 ~~~~~~~~~~
 1. **Install vLLM**
   .. code-block:: bash
       pip install vllm
 2. **Download the model and chat template**
   .. code-block:: bash
       pip install huggingface_hub
       huggingface-cli download katanemo/Plano-Orchestrator-4B
 3. **Start the vLLM server**
   For the 4B model (development):
   .. code-block:: bash
       vllm serve katanemo/Plano-Orchestrator-4B \
           --host 0.0.0.0 \
           --port 8000 \
           --tensor-parallel-size 1 \
           --gpu-memory-utilization 0.3 \
           --tokenizer katanemo/Plano-Orchestrator-4B \
           --chat-template chat_template.jinja \
           --served-model-name Plano-Orchestrator \
           --enable-prefix-caching
   For the 30B-A3B-FP8 model (production):
   .. code-block:: bash
       vllm serve katanemo/Plano-Orchestrator-30B-A3B-FP8 \
           --host 0.0.0.0 \
           --port 8000 \
           --tensor-parallel-size 1 \
           --gpu-memory-utilization 0.9 \
           --tokenizer katanemo/Plano-Orchestrator-30B-A3B-FP8 \
           --chat-template chat_template.jinja \
           --max-model-len 32768 \
           --served-model-name Plano-Orchestrator \
           --enable-prefix-caching
 4. **Configure Plano to use the local orchestrator**
   .. code-block:: yaml
       orchestration:
         model: Plano-Orchestrator
         llm_provider: plano-orchestrator
       model_providers:
         - name: plano-orchestrator
           model: Plano-Orchestrator
           base_url: http://<your-server-ip>:8000
 5. **Verify the server is running**
   .. code-block:: bash
       curl http://localhost:8000/health
       curl http://localhost:8000/v1/models
 Next Steps
 ----------