This commit is contained in:
adilhafeez 2026-03-11 22:29:19 +00:00
parent e062847e6e
commit 48ace749a5
3 changed files with 221 additions and 2 deletions

View file

@ -1,6 +1,6 @@
Plano Docs v0.4.11
llms.txt (auto-generated)
Generated (UTC): 2026-03-11T19:50:12.195349+00:00
Generated (UTC): 2026-03-11T22:29:16.432883+00:00
Table of contents
- Agents (concepts/agents)
@ -3756,6 +3756,109 @@ Flexible and Adaptive: Supports evolving user needs, model updates, and new doma
Production-Ready Performance: Optimized for low-latency, high-throughput applications in multi-model environments.
Self-hosting Arch-Router
By default, Plano uses a hosted Arch-Router endpoint. To run Arch-Router locally, you can serve the model yourself using either Ollama or vLLM.
Using Ollama (recommended for local development)
Install Ollama
Download and install from ollama.ai.
Pull and serve Arch-Router
ollama pull hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
ollama serve
This downloads the quantized GGUF model from HuggingFace and starts serving on http://localhost:11434.
Configure Plano to use local Arch-Router
routing:
model: Arch-Router
llm_provider: arch-router
model_providers:
- name: arch-router
model: arch/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
base_url: http://localhost:11434
- model: openai/gpt-5.2
access_key: $OPENAI_API_KEY
default: true
- model: anthropic/claude-sonnet-4-5
access_key: $ANTHROPIC_API_KEY
routing_preferences:
- name: creative writing
description: creative content generation, storytelling, and writing assistance
Verify the model is running
curl http://localhost:11434/v1/models
You should see Arch-Router-1.5B listed in the response.
Using vLLM (recommended for production / EC2)
vLLM provides higher throughput and GPU optimizations suitable for production deployments.
Install vLLM
pip install vllm
Download the model weights
The GGUF weights are downloaded automatically from HuggingFace on first use. To pre-download:
pip install huggingface_hub
huggingface-cli download katanemo/Arch-Router-1.5B.gguf
Start the vLLM server
After downloading, find the GGUF file and Jinja template in the HuggingFace cache:
# Find the downloaded files
SNAPSHOT_DIR=$(ls -d ~/.cache/huggingface/hub/models--katanemo--Arch-Router-1.5B.gguf/snapshots/*/ | head -1)
vllm serve ${SNAPSHOT_DIR}Arch-Router-1.5B-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 10000 \
--load-format gguf \
--chat-template ${SNAPSHOT_DIR}template.jinja \
--tokenizer katanemo/Arch-Router-1.5B \
--served-model-name Arch-Router \
--gpu-memory-utilization 0.3 \
--tensor-parallel-size 1 \
--enable-prefix-caching
Configure Plano to use the vLLM endpoint
routing:
model: Arch-Router
llm_provider: arch-router
model_providers:
- name: arch-router
model: Arch-Router
base_url: http://<your-server-ip>:10000
- model: openai/gpt-5.2
access_key: $OPENAI_API_KEY
default: true
- model: anthropic/claude-sonnet-4-5
access_key: $ANTHROPIC_API_KEY
routing_preferences:
- name: creative writing
description: creative content generation, storytelling, and writing assistance
Verify the server is running
curl http://localhost:10000/health
curl http://localhost:10000/v1/models
Combining Routing Methods
You can combine static model selection with dynamic routing preferences for maximum flexibility: