diff --git a/demos/llm_routing/preference_based_routing/README.md b/demos/llm_routing/preference_based_routing/README.md index 9d71971c..009002fd 100644 --- a/demos/llm_routing/preference_based_routing/README.md +++ b/demos/llm_routing/preference_based_routing/README.md @@ -32,6 +32,37 @@ planoai up config.yaml 3. Test with curl or open AnythingLLM http://localhost:3001/ +## Running with local Arch-Router (via Ollama) + +By default, Plano uses a hosted Arch-Router endpoint. To self-host Arch-Router locally using Ollama: + +1. Install [Ollama](https://ollama.ai) and pull the model: +```bash +ollama pull hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M +``` + +2. Make sure Ollama is running (`ollama serve` or the macOS app). + +3. Start Plano with the local config: +```bash +planoai up plano_config_local.yaml +``` + +4. Test routing: +```bash +curl -s "http://localhost:12000/routing/v1/messages" \ + -H "Content-Type: application/json" \ + -d '{ + "model": "gpt-4o-mini", + "max_tokens": 1024, + "messages": [ + {"role": "user", "content": "Create a REST API endpoint in Rust using actix-web"} + ] + }' +``` + +You should see the router select the appropriate model based on the routing preferences defined in `plano_config_local.yaml`. + # Testing out preference based routing We have defined two routes 1. code generation and 2. code understanding diff --git a/docs/source/guides/llm_router.rst b/docs/source/guides/llm_router.rst index 188b1e30..41c51b4a 100644 --- a/docs/source/guides/llm_router.rst +++ b/docs/source/guides/llm_router.rst @@ -228,6 +228,129 @@ In summary, Arch-Router demonstrates: - **Production-Ready Performance**: Optimized for low-latency, high-throughput applications in multi-model environments. +Self-hosting Arch-Router +------------------------ + +By default, Plano uses a hosted Arch-Router endpoint. To run Arch-Router locally, you can serve the model yourself using either **Ollama** or **vLLM**. + +Using Ollama (recommended for local development) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +1. **Install Ollama** + + Download and install from `ollama.ai `_. + +2. **Pull and serve Arch-Router** + + .. code-block:: bash + + ollama pull hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M + ollama serve + + This downloads the quantized GGUF model from HuggingFace and starts serving on ``http://localhost:11434``. + +3. **Configure Plano to use local Arch-Router** + + .. code-block:: yaml + + routing: + model: Arch-Router + llm_provider: arch-router + + model_providers: + - name: arch-router + model: arch/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M + base_url: http://localhost:11434 + + - model: openai/gpt-5.2 + access_key: $OPENAI_API_KEY + default: true + + - model: anthropic/claude-sonnet-4-5 + access_key: $ANTHROPIC_API_KEY + routing_preferences: + - name: creative writing + description: creative content generation, storytelling, and writing assistance + +4. **Verify the model is running** + + .. code-block:: bash + + curl http://localhost:11434/v1/models + + You should see ``Arch-Router-1.5B`` listed in the response. + +Using vLLM (recommended for production / EC2) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +vLLM provides higher throughput and GPU optimizations suitable for production deployments. + +1. **Install vLLM** + + .. code-block:: bash + + pip install vllm + +2. **Download the model weights** + + The GGUF weights are downloaded automatically from HuggingFace on first use. To pre-download: + + .. code-block:: bash + + pip install huggingface_hub + huggingface-cli download katanemo/Arch-Router-1.5B.gguf + +3. **Start the vLLM server** + + After downloading, find the GGUF file and Jinja template in the HuggingFace cache: + + .. code-block:: bash + + # Find the downloaded files + SNAPSHOT_DIR=$(ls -d ~/.cache/huggingface/hub/models--katanemo--Arch-Router-1.5B.gguf/snapshots/*/ | head -1) + + vllm serve ${SNAPSHOT_DIR}Arch-Router-1.5B-Q4_K_M.gguf \ + --host 0.0.0.0 \ + --port 10000 \ + --load-format gguf \ + --chat-template ${SNAPSHOT_DIR}template.jinja \ + --tokenizer katanemo/Arch-Router-1.5B \ + --served-model-name Arch-Router \ + --gpu-memory-utilization 0.3 \ + --tensor-parallel-size 1 \ + --enable-prefix-caching + +4. **Configure Plano to use the vLLM endpoint** + + .. code-block:: yaml + + routing: + model: Arch-Router + llm_provider: arch-router + + model_providers: + - name: arch-router + model: Arch-Router + base_url: http://:10000 + + - model: openai/gpt-5.2 + access_key: $OPENAI_API_KEY + default: true + + - model: anthropic/claude-sonnet-4-5 + access_key: $ANTHROPIC_API_KEY + routing_preferences: + - name: creative writing + description: creative content generation, storytelling, and writing assistance + +5. **Verify the server is running** + + .. code-block:: bash + + curl http://localhost:10000/health + curl http://localhost:10000/v1/models + + Combining Routing Methods -------------------------