add instructions on hosting arch-router locally (#819)

2026-07-23 16:51:04 +02:00 · 2026-03-11 15:28:50 -07:00 · 2026-03-11 15:28:50 -07:00 · 5400b0a2fa
commit 5400b0a2fa
parent b4313d93a4
2 changed files with 154 additions and 0 deletions
--- a/demos/llm_routing/preference_based_routing/README.md
+++ b/demos/llm_routing/preference_based_routing/README.md
@ -32,6 +32,37 @@ planoai up config.yaml

 3. Test with curl or open AnythingLLM http://localhost:3001/

+## Running with local Arch-Router (via Ollama)
+
+By default, Plano uses a hosted Arch-Router endpoint. To self-host Arch-Router locally using Ollama:
+
+1. Install [Ollama](https://ollama.ai) and pull the model:
+```bash
+ollama pull hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
+```
+
+2. Make sure Ollama is running (`ollama serve` or the macOS app).
+
+3. Start Plano with the local config:
+```bash
+planoai up plano_config_local.yaml
+```
+
+4. Test routing:
+```bash
+curl -s "http://localhost:12000/routing/v1/messages" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "gpt-4o-mini",
+    "max_tokens": 1024,
+    "messages": [
+      {"role": "user", "content": "Create a REST API endpoint in Rust using actix-web"}
+    ]
+  }'
+```
+
+You should see the router select the appropriate model based on the routing preferences defined in `plano_config_local.yaml`.
+
 # Testing out preference based routing

 We have defined two routes 1. code generation and 2. code understanding
--- a/docs/source/guides/llm_router.rst
+++ b/docs/source/guides/llm_router.rst
@ -228,6 +228,129 @@ In summary, Arch-Router demonstrates:
 - **Production-Ready Performance**: Optimized for low-latency, high-throughput applications in multi-model environments.


+Self-hosting Arch-Router
+------------------------
+
+By default, Plano uses a hosted Arch-Router endpoint. To run Arch-Router locally, you can serve the model yourself using either **Ollama** or **vLLM**.
+
+Using Ollama (recommended for local development)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. **Install Ollama**
+
+   Download and install from `ollama.ai <https://ollama.ai>`_.
+
+2. **Pull and serve Arch-Router**
+
+   .. code-block:: bash
+
+       ollama pull hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
+       ollama serve
+
+   This downloads the quantized GGUF model from HuggingFace and starts serving on ``http://localhost:11434``.
+
+3. **Configure Plano to use local Arch-Router**
+
+   .. code-block:: yaml
+
+       routing:
+         model: Arch-Router
+         llm_provider: arch-router
+
+       model_providers:
+         - name: arch-router
+           model: arch/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
+           base_url: http://localhost:11434
+
+         - model: openai/gpt-5.2
+           access_key: $OPENAI_API_KEY
+           default: true
+
+         - model: anthropic/claude-sonnet-4-5
+           access_key: $ANTHROPIC_API_KEY
+           routing_preferences:
+             - name: creative writing
+               description: creative content generation, storytelling, and writing assistance
+
+4. **Verify the model is running**
+
+   .. code-block:: bash
+
+       curl http://localhost:11434/v1/models
+
+   You should see ``Arch-Router-1.5B`` listed in the response.
+
+Using vLLM (recommended for production / EC2)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+vLLM provides higher throughput and GPU optimizations suitable for production deployments.
+
+1. **Install vLLM**
+
+   .. code-block:: bash
+
+       pip install vllm
+
+2. **Download the model weights**
+
+   The GGUF weights are downloaded automatically from HuggingFace on first use. To pre-download:
+
+   .. code-block:: bash
+
+       pip install huggingface_hub
+       huggingface-cli download katanemo/Arch-Router-1.5B.gguf
+
+3. **Start the vLLM server**
+
+   After downloading, find the GGUF file and Jinja template in the HuggingFace cache:
+
+   .. code-block:: bash
+
+       # Find the downloaded files
+       SNAPSHOT_DIR=$(ls -d ~/.cache/huggingface/hub/models--katanemo--Arch-Router-1.5B.gguf/snapshots/*/ | head -1)
+
+       vllm serve ${SNAPSHOT_DIR}Arch-Router-1.5B-Q4_K_M.gguf \
+           --host 0.0.0.0 \
+           --port 10000 \
+           --load-format gguf \
+           --chat-template ${SNAPSHOT_DIR}template.jinja \
+           --tokenizer katanemo/Arch-Router-1.5B \
+           --served-model-name Arch-Router \
+           --gpu-memory-utilization 0.3 \
+           --tensor-parallel-size 1 \
+           --enable-prefix-caching
+
+4. **Configure Plano to use the vLLM endpoint**
+
+   .. code-block:: yaml
+
+       routing:
+         model: Arch-Router
+         llm_provider: arch-router
+
+       model_providers:
+         - name: arch-router
+           model: Arch-Router
+           base_url: http://<your-server-ip>:10000
+
+         - model: openai/gpt-5.2
+           access_key: $OPENAI_API_KEY
+           default: true
+
+         - model: anthropic/claude-sonnet-4-5
+           access_key: $ANTHROPIC_API_KEY
+           routing_preferences:
+             - name: creative writing
+               description: creative content generation, storytelling, and writing assistance
+
+5. **Verify the server is running**
+
+   .. code-block:: bash
+
+       curl http://localhost:10000/health
+       curl http://localhost:10000/v1/models
+
+
 Combining Routing Methods
 -------------------------