deploy: 5400b0a2fa

2026-07-20 16:41:04 +02:00 · 2026-03-11 22:29:19 +00:00 · 2026-03-11 22:29:19 +00:00 · 48ace749a5
commit 48ace749a5
parent e062847e6e
3 changed files with 221 additions and 2 deletions
--- a/includes/llms.txt
+++ b/includes/llms.txt
@ -1,6 +1,6 @@
 Plano Docs v0.4.11
 llms.txt (auto-generated)
-Generated (UTC): 2026-03-11T19:50:12.195349+00:00
+Generated (UTC): 2026-03-11T22:29:16.432883+00:00

 Table of contents
 - Agents (concepts/agents)
@ -3756,6 +3756,109 @@ Flexible and Adaptive: Supports evolving user needs, model updates, and new doma

 Production-Ready Performance: Optimized for low-latency, high-throughput applications in multi-model environments.

+Self-hosting Arch-Router
+
+By default, Plano uses a hosted Arch-Router endpoint. To run Arch-Router locally, you can serve the model yourself using either Ollama or vLLM.
+
+Using Ollama (recommended for local development)
+
+Install Ollama
+
+Download and install from ollama.ai.
+
+Pull and serve Arch-Router
+
+ollama pull hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
+ollama serve
+
+This downloads the quantized GGUF model from HuggingFace and starts serving on http://localhost:11434.
+
+Configure Plano to use local Arch-Router
+
+routing:
+  model: Arch-Router
+  llm_provider: arch-router
+
+model_providers:
+  - name: arch-router
+    model: arch/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
+    base_url: http://localhost:11434
+
+  - model: openai/gpt-5.2
+    access_key: $OPENAI_API_KEY
+    default: true
+
+  - model: anthropic/claude-sonnet-4-5
+    access_key: $ANTHROPIC_API_KEY
+    routing_preferences:
+      - name: creative writing
+        description: creative content generation, storytelling, and writing assistance
+
+Verify the model is running
+
+curl http://localhost:11434/v1/models
+
+You should see Arch-Router-1.5B listed in the response.
+
+Using vLLM (recommended for production / EC2)
+
+vLLM provides higher throughput and GPU optimizations suitable for production deployments.
+
+Install vLLM
+
+pip install vllm
+
+Download the model weights
+
+The GGUF weights are downloaded automatically from HuggingFace on first use. To pre-download:
+
+pip install huggingface_hub
+huggingface-cli download katanemo/Arch-Router-1.5B.gguf
+
+Start the vLLM server
+
+After downloading, find the GGUF file and Jinja template in the HuggingFace cache:
+
+# Find the downloaded files
+SNAPSHOT_DIR=$(ls -d ~/.cache/huggingface/hub/models--katanemo--Arch-Router-1.5B.gguf/snapshots/*/ | head -1)
+
+vllm serve ${SNAPSHOT_DIR}Arch-Router-1.5B-Q4_K_M.gguf \
+    --host 0.0.0.0 \
+    --port 10000 \
+    --load-format gguf \
+    --chat-template ${SNAPSHOT_DIR}template.jinja \
+    --tokenizer katanemo/Arch-Router-1.5B \
+    --served-model-name Arch-Router \
+    --gpu-memory-utilization 0.3 \
+    --tensor-parallel-size 1 \
+    --enable-prefix-caching
+
+Configure Plano to use the vLLM endpoint
+
+routing:
+  model: Arch-Router
+  llm_provider: arch-router
+
+model_providers:
+  - name: arch-router
+    model: Arch-Router
+    base_url: http://<your-server-ip>:10000
+
+  - model: openai/gpt-5.2
+    access_key: $OPENAI_API_KEY
+    default: true
+
+  - model: anthropic/claude-sonnet-4-5
+    access_key: $ANTHROPIC_API_KEY
+    routing_preferences:
+      - name: creative writing
+        description: creative content generation, storytelling, and writing assistance
+
+Verify the server is running
+
+curl http://localhost:10000/health
+curl http://localhost:10000/v1/models
+
 Combining Routing Methods

 You can combine static model selection with dynamic routing preferences for maximum flexibility: