mirror of
https://github.com/katanemo/plano.git
synced 2026-06-02 14:35:14 +02:00
deploy: 5400b0a2fa
This commit is contained in:
parent
e062847e6e
commit
48ace749a5
3 changed files with 221 additions and 2 deletions
|
|
@ -1,6 +1,6 @@
|
|||
Plano Docs v0.4.11
|
||||
llms.txt (auto-generated)
|
||||
Generated (UTC): 2026-03-11T19:50:12.195349+00:00
|
||||
Generated (UTC): 2026-03-11T22:29:16.432883+00:00
|
||||
|
||||
Table of contents
|
||||
- Agents (concepts/agents)
|
||||
|
|
@ -3756,6 +3756,109 @@ Flexible and Adaptive: Supports evolving user needs, model updates, and new doma
|
|||
|
||||
Production-Ready Performance: Optimized for low-latency, high-throughput applications in multi-model environments.
|
||||
|
||||
Self-hosting Arch-Router
|
||||
|
||||
By default, Plano uses a hosted Arch-Router endpoint. To run Arch-Router locally, you can serve the model yourself using either Ollama or vLLM.
|
||||
|
||||
Using Ollama (recommended for local development)
|
||||
|
||||
Install Ollama
|
||||
|
||||
Download and install from ollama.ai.
|
||||
|
||||
Pull and serve Arch-Router
|
||||
|
||||
ollama pull hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
|
||||
ollama serve
|
||||
|
||||
This downloads the quantized GGUF model from HuggingFace and starts serving on http://localhost:11434.
|
||||
|
||||
Configure Plano to use local Arch-Router
|
||||
|
||||
routing:
|
||||
model: Arch-Router
|
||||
llm_provider: arch-router
|
||||
|
||||
model_providers:
|
||||
- name: arch-router
|
||||
model: arch/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
|
||||
base_url: http://localhost:11434
|
||||
|
||||
- model: openai/gpt-5.2
|
||||
access_key: $OPENAI_API_KEY
|
||||
default: true
|
||||
|
||||
- model: anthropic/claude-sonnet-4-5
|
||||
access_key: $ANTHROPIC_API_KEY
|
||||
routing_preferences:
|
||||
- name: creative writing
|
||||
description: creative content generation, storytelling, and writing assistance
|
||||
|
||||
Verify the model is running
|
||||
|
||||
curl http://localhost:11434/v1/models
|
||||
|
||||
You should see Arch-Router-1.5B listed in the response.
|
||||
|
||||
Using vLLM (recommended for production / EC2)
|
||||
|
||||
vLLM provides higher throughput and GPU optimizations suitable for production deployments.
|
||||
|
||||
Install vLLM
|
||||
|
||||
pip install vllm
|
||||
|
||||
Download the model weights
|
||||
|
||||
The GGUF weights are downloaded automatically from HuggingFace on first use. To pre-download:
|
||||
|
||||
pip install huggingface_hub
|
||||
huggingface-cli download katanemo/Arch-Router-1.5B.gguf
|
||||
|
||||
Start the vLLM server
|
||||
|
||||
After downloading, find the GGUF file and Jinja template in the HuggingFace cache:
|
||||
|
||||
# Find the downloaded files
|
||||
SNAPSHOT_DIR=$(ls -d ~/.cache/huggingface/hub/models--katanemo--Arch-Router-1.5B.gguf/snapshots/*/ | head -1)
|
||||
|
||||
vllm serve ${SNAPSHOT_DIR}Arch-Router-1.5B-Q4_K_M.gguf \
|
||||
--host 0.0.0.0 \
|
||||
--port 10000 \
|
||||
--load-format gguf \
|
||||
--chat-template ${SNAPSHOT_DIR}template.jinja \
|
||||
--tokenizer katanemo/Arch-Router-1.5B \
|
||||
--served-model-name Arch-Router \
|
||||
--gpu-memory-utilization 0.3 \
|
||||
--tensor-parallel-size 1 \
|
||||
--enable-prefix-caching
|
||||
|
||||
Configure Plano to use the vLLM endpoint
|
||||
|
||||
routing:
|
||||
model: Arch-Router
|
||||
llm_provider: arch-router
|
||||
|
||||
model_providers:
|
||||
- name: arch-router
|
||||
model: Arch-Router
|
||||
base_url: http://<your-server-ip>:10000
|
||||
|
||||
- model: openai/gpt-5.2
|
||||
access_key: $OPENAI_API_KEY
|
||||
default: true
|
||||
|
||||
- model: anthropic/claude-sonnet-4-5
|
||||
access_key: $ANTHROPIC_API_KEY
|
||||
routing_preferences:
|
||||
- name: creative writing
|
||||
description: creative content generation, storytelling, and writing assistance
|
||||
|
||||
Verify the server is running
|
||||
|
||||
curl http://localhost:10000/health
|
||||
curl http://localhost:10000/v1/models
|
||||
|
||||
Combining Routing Methods
|
||||
|
||||
You can combine static model selection with dynamic routing preferences for maximum flexibility:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue