diff --git a/demos/agent_orchestration/travel_agents/README.md b/demos/agent_orchestration/travel_agents/README.md index 7886539d..9ae46dde 100644 --- a/demos/agent_orchestration/travel_agents/README.md +++ b/demos/agent_orchestration/travel_agents/README.md @@ -123,6 +123,42 @@ Each agent: Both agents run as native local processes and communicate with Plano running natively on the host. +## Running with local Plano-Orchestrator (via vLLM) + +By default, Plano uses a hosted Plano-Orchestrator endpoint. To self-host the orchestrator model locally using vLLM on a server with an NVIDIA GPU: + +1. Install vLLM and download the model: +```bash +pip install vllm +``` + +2. Start the vLLM server with the 4B model: +```bash +vllm serve katanemo/Plano-Orchestrator-4B \ + --host 0.0.0.0 \ + --port 8000 \ + --tensor-parallel-size 1 \ + --gpu-memory-utilization 0.3 \ + --tokenizer katanemo/Plano-Orchestrator-4B \ + --chat-template chat_template.jinja \ + --served-model-name Plano-Orchestrator \ + --enable-prefix-caching +``` + +3. Start the demo with the local orchestrator config: +```bash +./run_demo.sh --local-orchestrator +``` + +4. Test with curl: +```bash +curl -X POST http://localhost:8001/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"model": "gpt-5.2", "messages": [{"role": "user", "content": "What is the weather in Istanbul?"}]}' +``` + +You should see Plano use your local orchestrator to route the request to the weather agent. + ## Observability This demo includes full OpenTelemetry (OTel) compatible distributed tracing to monitor and debug agent interactions: diff --git a/demos/agent_orchestration/travel_agents/config_local_orchestrator.yaml b/demos/agent_orchestration/travel_agents/config_local_orchestrator.yaml new file mode 100644 index 00000000..babc9401 --- /dev/null +++ b/demos/agent_orchestration/travel_agents/config_local_orchestrator.yaml @@ -0,0 +1,68 @@ +version: v0.3.0 + +orchestration: + model: Plano-Orchestrator + llm_provider: plano-orchestrator + +agents: + - id: weather_agent + url: http://localhost:10510 + - id: flight_agent + url: http://localhost:10520 + +model_providers: + - name: plano-orchestrator + model: Plano-Orchestrator + base_url: http://localhost:8000 + + - model: openai/gpt-5.2 + access_key: $OPENAI_API_KEY + default: true + - model: openai/gpt-4o-mini + access_key: $OPENAI_API_KEY # smaller, faster, cheaper model for extracting entities like location + +listeners: + - type: agent + name: travel_booking_service + port: 8001 + router: plano_orchestrator_v1 + agents: + - id: weather_agent + description: | + + WeatherAgent is a specialized AI assistant for real-time weather information and forecasts. It provides accurate weather data for any city worldwide using the Open-Meteo API, helping travelers plan their trips with up-to-date weather conditions. + + Capabilities: + * Get real-time weather conditions and multi-day forecasts for any city worldwide using Open-Meteo API (free, no API key needed) + * Provides current temperature + * Provides multi-day forecasts + * Provides weather conditions + * Provides sunrise/sunset times + * Provides detailed weather information + * Understands conversation context to resolve location references from previous messages + * Handles weather-related questions including "What's the weather in [city]?", "What's the forecast for [city]?", "How's the weather in [city]?" + * When queries include both weather and other travel questions (e.g., flights, currency), this agent answers ONLY the weather part + + - id: flight_agent + description: | + + FlightAgent is an AI-powered tool specialized in providing live flight information between airports. It leverages the FlightAware AeroAPI to deliver real-time flight status, gate information, and delay updates. + + Capabilities: + * Get live flight information between airports using FlightAware AeroAPI + * Shows real-time flight status + * Shows scheduled/estimated/actual departure and arrival times + * Shows gate and terminal information + * Shows delays + * Shows aircraft type + * Shows flight status + * Automatically resolves city names to airport codes (IATA/ICAO) + * Understands conversation context to infer origin/destination from follow-up questions + * Handles flight-related questions including "What flights go from [city] to [city]?", "Do flights go to [city]?", "Are there direct flights from [city]?" + * When queries include both flight and other travel questions (e.g., weather, currency), this agent answers ONLY the flight part + +tracing: + random_sampling: 100 + span_attributes: + header_prefixes: + - x-acme- diff --git a/demos/agent_orchestration/travel_agents/run_demo.sh b/demos/agent_orchestration/travel_agents/run_demo.sh index 643a0aa2..35166b85 100755 --- a/demos/agent_orchestration/travel_agents/run_demo.sh +++ b/demos/agent_orchestration/travel_agents/run_demo.sh @@ -31,8 +31,13 @@ start_demo() { fi # Step 4: Start Plano - echo "Starting Plano with config.yaml..." - planoai up config.yaml + PLANO_CONFIG="config.yaml" + if [ "$1" == "--local-orchestrator" ]; then + PLANO_CONFIG="config_local_orchestrator.yaml" + echo "Using local orchestrator config..." + fi + echo "Starting Plano with $PLANO_CONFIG..." + planoai up "$PLANO_CONFIG" # Step 5: Start agents natively echo "Starting agents..." diff --git a/docs/source/guides/orchestration.rst b/docs/source/guides/orchestration.rst index 3170b65f..20b5455a 100644 --- a/docs/source/guides/orchestration.rst +++ b/docs/source/guides/orchestration.rst @@ -335,6 +335,87 @@ Combine RAG agents for documentation lookup with specialized troubleshooting age - id: troubleshoot_agent description: Diagnoses and resolves technical issues step by step +Self-hosting Plano-Orchestrator +------------------------------- + +By default, Plano uses a hosted Plano-Orchestrator endpoint. To self-host the orchestrator model, you can serve it using **vLLM** on a server with an NVIDIA GPU. + +.. note:: + vLLM requires a Linux server with an NVIDIA GPU (CUDA). For local development on macOS, a GGUF version for Ollama is coming soon. + +Two model variants are available on HuggingFace: + +* `Plano-Orchestrator-4B `_ — lighter model, suitable for development and testing +* `Plano-Orchestrator-30B-A3B `_ — full-size model for production (FP8 quantized variant also available) + +Using vLLM +~~~~~~~~~~ + +1. **Install vLLM** + + .. code-block:: bash + + pip install vllm + +2. **Download the model and chat template** + + .. code-block:: bash + + pip install huggingface_hub + huggingface-cli download katanemo/Plano-Orchestrator-4B + +3. **Start the vLLM server** + + For the 4B model (development): + + .. code-block:: bash + + vllm serve katanemo/Plano-Orchestrator-4B \ + --host 0.0.0.0 \ + --port 8000 \ + --tensor-parallel-size 1 \ + --gpu-memory-utilization 0.3 \ + --tokenizer katanemo/Plano-Orchestrator-4B \ + --chat-template chat_template.jinja \ + --served-model-name Plano-Orchestrator \ + --enable-prefix-caching + + For the 30B-A3B-FP8 model (production): + + .. code-block:: bash + + vllm serve katanemo/Plano-Orchestrator-30B-A3B-FP8 \ + --host 0.0.0.0 \ + --port 8000 \ + --tensor-parallel-size 1 \ + --gpu-memory-utilization 0.9 \ + --tokenizer katanemo/Plano-Orchestrator-30B-A3B-FP8 \ + --chat-template chat_template.jinja \ + --max-model-len 32768 \ + --served-model-name Plano-Orchestrator \ + --enable-prefix-caching + +4. **Configure Plano to use the local orchestrator** + + .. code-block:: yaml + + orchestration: + model: Plano-Orchestrator + llm_provider: plano-orchestrator + + model_providers: + - name: plano-orchestrator + model: Plano-Orchestrator + base_url: http://:8000 + +5. **Verify the server is running** + + .. code-block:: bash + + curl http://localhost:8000/health + curl http://localhost:8000/v1/models + + Next Steps ----------