This commit is contained in:
adilhafeez 2026-03-11 22:29:19 +00:00
parent e062847e6e
commit 48ace749a5
3 changed files with 221 additions and 2 deletions

View file

@ -363,6 +363,117 @@ Provides a practical mechanism to encode user preferences through domain-action
<li><p><strong>Production-Ready Performance</strong>: Optimized for low-latency, high-throughput applications in multi-model environments.</p></li>
</ul>
</section>
<section id="self-hosting-arch-router">
<h2>Self-hosting Arch-Router<a @click.prevent="window.navigator.clipboard.writeText($el.href); $el.setAttribute('data-tooltip', 'Copied!'); setTimeout(() =&gt; $el.setAttribute('data-tooltip', 'Copy link to this element'), 2000)" aria-label="Copy link to this element" class="headerlink" data-tooltip="Copy link to this element" href="#self-hosting-arch-router" x-intersect.margin.0%.0%.-70%.0%="activeSection = '#self-hosting-arch-router'"><svg height="1em" viewbox="0 0 24 24" width="1em" xmlns="http://www.w3.org/2000/svg"><path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76 0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71 0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71 0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76 0 5-2.24 5-5s-2.24-5-5-5z"></path></svg></a></h2>
<p>By default, Plano uses a hosted Arch-Router endpoint. To run Arch-Router locally, you can serve the model yourself using either <strong>Ollama</strong> or <strong>vLLM</strong>.</p>
<section id="using-ollama-recommended-for-local-development">
<h3>Using Ollama (recommended for local development)<a @click.prevent="window.navigator.clipboard.writeText($el.href); $el.setAttribute('data-tooltip', 'Copied!'); setTimeout(() =&gt; $el.setAttribute('data-tooltip', 'Copy link to this element'), 2000)" aria-label="Copy link to this element" class="headerlink" data-tooltip="Copy link to this element" href="#using-ollama-recommended-for-local-development" x-intersect.margin.0%.0%.-70%.0%="activeSection = '#using-ollama-recommended-for-local-development'"><svg height="1em" viewbox="0 0 24 24" width="1em" xmlns="http://www.w3.org/2000/svg"><path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76 0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71 0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71 0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76 0 5-2.24 5-5s-2.24-5-5-5z"></path></svg></a></h3>
<ol class="arabic">
<li><p><strong>Install Ollama</strong></p>
<p>Download and install from <a class="reference external" href="https://ollama.ai" rel="nofollow noopener">ollama.ai<svg fill="currentColor" height="1em" stroke="none" viewbox="0 96 960 960" width="1em" xmlns="http://www.w3.org/2000/svg"><path d="M188 868q-11-11-11-28t11-28l436-436H400q-17 0-28.5-11.5T360 336q0-17 11.5-28.5T400 296h320q17 0 28.5 11.5T760 336v320q0 17-11.5 28.5T720 696q-17 0-28.5-11.5T680 656V432L244 868q-11 11-28 11t-28-11Z"></path></svg></a>.</p>
</li>
<li><p><strong>Pull and serve Arch-Router</strong></p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><code><span id="line-1">ollama<span class="w"> </span>pull<span class="w"> </span>hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
</span><span id="line-2">ollama<span class="w"> </span>serve
</span></code></pre></div>
</div>
<p>This downloads the quantized GGUF model from HuggingFace and starts serving on <code class="docutils literal notranslate"><span class="pre">http://localhost:11434</span></code>.</p>
</li>
<li><p><strong>Configure Plano to use local Arch-Router</strong></p>
<div class="highlight-yaml notranslate"><div class="highlight"><pre><span></span><code><span id="line-1"><span class="nt">routing</span><span class="p">:</span>
</span><span id="line-2"><span class="w"> </span><span class="nt">model</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Arch-Router</span>
</span><span id="line-3"><span class="w"> </span><span class="nt">llm_provider</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">arch-router</span>
</span><span id="line-4">
</span><span id="line-5"><span class="nt">model_providers</span><span class="p">:</span>
</span><span id="line-6"><span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">arch-router</span>
</span><span id="line-7"><span class="w"> </span><span class="nt">model</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">arch/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M</span>
</span><span id="line-8"><span class="w"> </span><span class="nt">base_url</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">http://localhost:11434</span>
</span><span id="line-9">
</span><span id="line-10"><span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">model</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">openai/gpt-5.2</span>
</span><span id="line-11"><span class="w"> </span><span class="nt">access_key</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">$OPENAI_API_KEY</span>
</span><span id="line-12"><span class="w"> </span><span class="nt">default</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">true</span>
</span><span id="line-13">
</span><span id="line-14"><span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">model</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">anthropic/claude-sonnet-4-5</span>
</span><span id="line-15"><span class="w"> </span><span class="nt">access_key</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">$ANTHROPIC_API_KEY</span>
</span><span id="line-16"><span class="w"> </span><span class="nt">routing_preferences</span><span class="p">:</span>
</span><span id="line-17"><span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">creative writing</span>
</span><span id="line-18"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">creative content generation, storytelling, and writing assistance</span>
</span></code></pre></div>
</div>
</li>
<li><p><strong>Verify the model is running</strong></p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><code><span id="line-1">curl<span class="w"> </span>http://localhost:11434/v1/models
</span></code></pre></div>
</div>
<p>You should see <code class="docutils literal notranslate"><span class="pre">Arch-Router-1.5B</span></code> listed in the response.</p>
</li>
</ol>
</section>
<section id="using-vllm-recommended-for-production-ec2">
<h3>Using vLLM (recommended for production / EC2)<a @click.prevent="window.navigator.clipboard.writeText($el.href); $el.setAttribute('data-tooltip', 'Copied!'); setTimeout(() =&gt; $el.setAttribute('data-tooltip', 'Copy link to this element'), 2000)" aria-label="Copy link to this element" class="headerlink" data-tooltip="Copy link to this element" href="#using-vllm-recommended-for-production-ec2" x-intersect.margin.0%.0%.-70%.0%="activeSection = '#using-vllm-recommended-for-production-ec2'"><svg height="1em" viewbox="0 0 24 24" width="1em" xmlns="http://www.w3.org/2000/svg"><path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76 0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71 0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71 0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76 0 5-2.24 5-5s-2.24-5-5-5z"></path></svg></a></h3>
<p>vLLM provides higher throughput and GPU optimizations suitable for production deployments.</p>
<ol class="arabic">
<li><p><strong>Install vLLM</strong></p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><code><span id="line-1">pip<span class="w"> </span>install<span class="w"> </span>vllm
</span></code></pre></div>
</div>
</li>
<li><p><strong>Download the model weights</strong></p>
<p>The GGUF weights are downloaded automatically from HuggingFace on first use. To pre-download:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><code><span id="line-1">pip<span class="w"> </span>install<span class="w"> </span>huggingface_hub
</span><span id="line-2">huggingface-cli<span class="w"> </span>download<span class="w"> </span>katanemo/Arch-Router-1.5B.gguf
</span></code></pre></div>
</div>
</li>
<li><p><strong>Start the vLLM server</strong></p>
<p>After downloading, find the GGUF file and Jinja template in the HuggingFace cache:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><code><span id="line-1"><span class="c1"># Find the downloaded files</span>
</span><span id="line-2"><span class="nv">SNAPSHOT_DIR</span><span class="o">=</span><span class="k">$(</span>ls<span class="w"> </span>-d<span class="w"> </span>~/.cache/huggingface/hub/models--katanemo--Arch-Router-1.5B.gguf/snapshots/*/<span class="w"> </span><span class="p">|</span><span class="w"> </span>head<span class="w"> </span>-1<span class="k">)</span>
</span><span id="line-3">
</span><span id="line-4">vllm<span class="w"> </span>serve<span class="w"> </span><span class="si">${</span><span class="nv">SNAPSHOT_DIR</span><span class="si">}</span>Arch-Router-1.5B-Q4_K_M.gguf<span class="w"> </span><span class="se">\</span>
</span><span id="line-5"><span class="w"> </span>--host<span class="w"> </span><span class="m">0</span>.0.0.0<span class="w"> </span><span class="se">\</span>
</span><span id="line-6"><span class="w"> </span>--port<span class="w"> </span><span class="m">10000</span><span class="w"> </span><span class="se">\</span>
</span><span id="line-7"><span class="w"> </span>--load-format<span class="w"> </span>gguf<span class="w"> </span><span class="se">\</span>
</span><span id="line-8"><span class="w"> </span>--chat-template<span class="w"> </span><span class="si">${</span><span class="nv">SNAPSHOT_DIR</span><span class="si">}</span>template.jinja<span class="w"> </span><span class="se">\</span>
</span><span id="line-9"><span class="w"> </span>--tokenizer<span class="w"> </span>katanemo/Arch-Router-1.5B<span class="w"> </span><span class="se">\</span>
</span><span id="line-10"><span class="w"> </span>--served-model-name<span class="w"> </span>Arch-Router<span class="w"> </span><span class="se">\</span>
</span><span id="line-11"><span class="w"> </span>--gpu-memory-utilization<span class="w"> </span><span class="m">0</span>.3<span class="w"> </span><span class="se">\</span>
</span><span id="line-12"><span class="w"> </span>--tensor-parallel-size<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
</span><span id="line-13"><span class="w"> </span>--enable-prefix-caching
</span></code></pre></div>
</div>
</li>
<li><p><strong>Configure Plano to use the vLLM endpoint</strong></p>
<div class="highlight-yaml notranslate"><div class="highlight"><pre><span></span><code><span id="line-1"><span class="nt">routing</span><span class="p">:</span>
</span><span id="line-2"><span class="w"> </span><span class="nt">model</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Arch-Router</span>
</span><span id="line-3"><span class="w"> </span><span class="nt">llm_provider</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">arch-router</span>
</span><span id="line-4">
</span><span id="line-5"><span class="nt">model_providers</span><span class="p">:</span>
</span><span id="line-6"><span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">arch-router</span>
</span><span id="line-7"><span class="w"> </span><span class="nt">model</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Arch-Router</span>
</span><span id="line-8"><span class="w"> </span><span class="nt">base_url</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">http://&lt;your-server-ip&gt;:10000</span>
</span><span id="line-9">
</span><span id="line-10"><span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">model</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">openai/gpt-5.2</span>
</span><span id="line-11"><span class="w"> </span><span class="nt">access_key</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">$OPENAI_API_KEY</span>
</span><span id="line-12"><span class="w"> </span><span class="nt">default</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">true</span>
</span><span id="line-13">
</span><span id="line-14"><span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">model</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">anthropic/claude-sonnet-4-5</span>
</span><span id="line-15"><span class="w"> </span><span class="nt">access_key</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">$ANTHROPIC_API_KEY</span>
</span><span id="line-16"><span class="w"> </span><span class="nt">routing_preferences</span><span class="p">:</span>
</span><span id="line-17"><span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">creative writing</span>
</span><span id="line-18"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">creative content generation, storytelling, and writing assistance</span>
</span></code></pre></div>
</div>
</li>
<li><p><strong>Verify the server is running</strong></p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><code><span id="line-1">curl<span class="w"> </span>http://localhost:10000/health
</span><span id="line-2">curl<span class="w"> </span>http://localhost:10000/v1/models
</span></code></pre></div>
</div>
</li>
</ol>
</section>
</section>
<section id="combining-routing-methods">
<h2>Combining Routing Methods<a @click.prevent="window.navigator.clipboard.writeText($el.href); $el.setAttribute('data-tooltip', 'Copied!'); setTimeout(() =&gt; $el.setAttribute('data-tooltip', 'Copy link to this element'), 2000)" aria-label="Copy link to this element" class="headerlink" data-tooltip="Copy link to this element" href="#combining-routing-methods" x-intersect.margin.0%.0%.-70%.0%="activeSection = '#combining-routing-methods'"><svg height="1em" viewbox="0 0 24 24" width="1em" xmlns="http://www.w3.org/2000/svg"><path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76 0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71 0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71 0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76 0 5-2.24 5-5s-2.24-5-5-5z"></path></svg></a></h2>
<p>You can combine static model selection with dynamic routing preferences for maximum flexibility:</p>
@ -495,6 +606,11 @@ Provides a practical mechanism to encode user preferences through domain-action
</ul>
</li>
<li><a :data-current="activeSection === '#id7'" class="reference internal" href="#id7">Arch-Router</a></li>
<li><a :data-current="activeSection === '#self-hosting-arch-router'" class="reference internal" href="#self-hosting-arch-router">Self-hosting Arch-Router</a><ul>
<li><a :data-current="activeSection === '#using-ollama-recommended-for-local-development'" class="reference internal" href="#using-ollama-recommended-for-local-development">Using Ollama (recommended for local development)</a></li>
<li><a :data-current="activeSection === '#using-vllm-recommended-for-production-ec2'" class="reference internal" href="#using-vllm-recommended-for-production-ec2">Using vLLM (recommended for production / EC2)</a></li>
</ul>
</li>
<li><a :data-current="activeSection === '#combining-routing-methods'" class="reference internal" href="#combining-routing-methods">Combining Routing Methods</a></li>
<li><a :data-current="activeSection === '#example-use-cases'" class="reference internal" href="#example-use-cases">Example Use Cases</a></li>
<li><a :data-current="activeSection === '#best-practices'" class="reference internal" href="#best-practices">Best practices</a></li>

View file

@ -1,6 +1,6 @@
Plano Docs v0.4.11
llms.txt (auto-generated)
Generated (UTC): 2026-03-11T19:50:12.195349+00:00
Generated (UTC): 2026-03-11T22:29:16.432883+00:00
Table of contents
- Agents (concepts/agents)
@ -3756,6 +3756,109 @@ Flexible and Adaptive: Supports evolving user needs, model updates, and new doma
Production-Ready Performance: Optimized for low-latency, high-throughput applications in multi-model environments.
Self-hosting Arch-Router
By default, Plano uses a hosted Arch-Router endpoint. To run Arch-Router locally, you can serve the model yourself using either Ollama or vLLM.
Using Ollama (recommended for local development)
Install Ollama
Download and install from ollama.ai.
Pull and serve Arch-Router
ollama pull hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
ollama serve
This downloads the quantized GGUF model from HuggingFace and starts serving on http://localhost:11434.
Configure Plano to use local Arch-Router
routing:
model: Arch-Router
llm_provider: arch-router
model_providers:
- name: arch-router
model: arch/hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
base_url: http://localhost:11434
- model: openai/gpt-5.2
access_key: $OPENAI_API_KEY
default: true
- model: anthropic/claude-sonnet-4-5
access_key: $ANTHROPIC_API_KEY
routing_preferences:
- name: creative writing
description: creative content generation, storytelling, and writing assistance
Verify the model is running
curl http://localhost:11434/v1/models
You should see Arch-Router-1.5B listed in the response.
Using vLLM (recommended for production / EC2)
vLLM provides higher throughput and GPU optimizations suitable for production deployments.
Install vLLM
pip install vllm
Download the model weights
The GGUF weights are downloaded automatically from HuggingFace on first use. To pre-download:
pip install huggingface_hub
huggingface-cli download katanemo/Arch-Router-1.5B.gguf
Start the vLLM server
After downloading, find the GGUF file and Jinja template in the HuggingFace cache:
# Find the downloaded files
SNAPSHOT_DIR=$(ls -d ~/.cache/huggingface/hub/models--katanemo--Arch-Router-1.5B.gguf/snapshots/*/ | head -1)
vllm serve ${SNAPSHOT_DIR}Arch-Router-1.5B-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 10000 \
--load-format gguf \
--chat-template ${SNAPSHOT_DIR}template.jinja \
--tokenizer katanemo/Arch-Router-1.5B \
--served-model-name Arch-Router \
--gpu-memory-utilization 0.3 \
--tensor-parallel-size 1 \
--enable-prefix-caching
Configure Plano to use the vLLM endpoint
routing:
model: Arch-Router
llm_provider: arch-router
model_providers:
- name: arch-router
model: Arch-Router
base_url: http://<your-server-ip>:10000
- model: openai/gpt-5.2
access_key: $OPENAI_API_KEY
default: true
- model: anthropic/claude-sonnet-4-5
access_key: $ANTHROPIC_API_KEY
routing_preferences:
- name: creative writing
description: creative content generation, storytelling, and writing assistance
Verify the server is running
curl http://localhost:10000/health
curl http://localhost:10000/v1/models
Combining Routing Methods
You can combine static model selection with dynamic routing preferences for maximum flexibility:

File diff suppressed because one or more lines are too long