add instructions on hosting arch-router locally (#819)

This commit is contained in:
Adil Hafeez 2026-03-11 15:28:50 -07:00 committed by GitHub
parent b4313d93a4
commit 5400b0a2fa
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 154 additions and 0 deletions

View file

@ -32,6 +32,37 @@ planoai up config.yaml
3. Test with curl or open AnythingLLM http://localhost:3001/ 3. Test with curl or open AnythingLLM http://localhost:3001/
## Running with local Arch-Router (via Ollama)
By default, Plano uses a hosted Arch-Router endpoint. To self-host Arch-Router locally using Ollama:
1. Install [Ollama](https://ollama.ai) and pull the model:
```bash
ollama pull hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
```
2. Make sure Ollama is running (`ollama serve` or the macOS app).
3. Start Plano with the local config:
```bash
planoai up plano_config_local.yaml
```
4. Test routing:
```bash
curl -s "http://localhost:12000/routing/v1/messages" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": "Create a REST API endpoint in Rust using actix-web"}
]
}'
```
You should see the router select the appropriate model based on the routing preferences defined in `plano_config_local.yaml`.
# Testing out preference based routing # Testing out preference based routing
We have defined two routes 1. code generation and 2. code understanding We have defined two routes 1. code generation and 2. code understanding

View file

@ -228,6 +228,129 @@ In summary, Arch-Router demonstrates:
- **Production-Ready Performance**: Optimized for low-latency, high-throughput applications in multi-model environments. - **Production-Ready Performance**: Optimized for low-latency, high-throughput applications in multi-model environments.
Self-hosting Arch-Router
------------------------
By default, Plano uses a hosted Arch-Router endpoint. To run Arch-Router locally, you can serve the model yourself using either **Ollama** or **vLLM**.
Using Ollama (recommended for local development)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. **Install Ollama**
Download and install from `ollama.ai <https://ollama.ai>`_.
2. **Pull and serve Arch-Router**
.. code-block:: bash
ollama pull hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
ollama serve
This downloads the quantized GGUF model from HuggingFace and starts serving on ``http://localhost:11434``.
3. **Configure Plano to use local Arch-Router**
.. code-block:: yaml
routing:
model: Arch-Router
llm_provider: arch-router
model_providers:
- name: arch-router
model: arch/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
base_url: http://localhost:11434
- model: openai/gpt-5.2
access_key: $OPENAI_API_KEY
default: true
- model: anthropic/claude-sonnet-4-5
access_key: $ANTHROPIC_API_KEY
routing_preferences:
- name: creative writing
description: creative content generation, storytelling, and writing assistance
4. **Verify the model is running**
.. code-block:: bash
curl http://localhost:11434/v1/models
You should see ``Arch-Router-1.5B`` listed in the response.
Using vLLM (recommended for production / EC2)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
vLLM provides higher throughput and GPU optimizations suitable for production deployments.
1. **Install vLLM**
.. code-block:: bash
pip install vllm
2. **Download the model weights**
The GGUF weights are downloaded automatically from HuggingFace on first use. To pre-download:
.. code-block:: bash
pip install huggingface_hub
huggingface-cli download katanemo/Arch-Router-1.5B.gguf
3. **Start the vLLM server**
After downloading, find the GGUF file and Jinja template in the HuggingFace cache:
.. code-block:: bash
# Find the downloaded files
SNAPSHOT_DIR=$(ls -d ~/.cache/huggingface/hub/models--katanemo--Arch-Router-1.5B.gguf/snapshots/*/ | head -1)
vllm serve ${SNAPSHOT_DIR}Arch-Router-1.5B-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 10000 \
--load-format gguf \
--chat-template ${SNAPSHOT_DIR}template.jinja \
--tokenizer katanemo/Arch-Router-1.5B \
--served-model-name Arch-Router \
--gpu-memory-utilization 0.3 \
--tensor-parallel-size 1 \
--enable-prefix-caching
4. **Configure Plano to use the vLLM endpoint**
.. code-block:: yaml
routing:
model: Arch-Router
llm_provider: arch-router
model_providers:
- name: arch-router
model: Arch-Router
base_url: http://<your-server-ip>:10000
- model: openai/gpt-5.2
access_key: $OPENAI_API_KEY
default: true
- model: anthropic/claude-sonnet-4-5
access_key: $ANTHROPIC_API_KEY
routing_preferences:
- name: creative writing
description: creative content generation, storytelling, and writing assistance
5. **Verify the server is running**
.. code-block:: bash
curl http://localhost:10000/health
curl http://localhost:10000/v1/models
Combining Routing Methods Combining Routing Methods
------------------------- -------------------------