18 KiB
Configuration Guide
Configuration File
The NOMYO Router is configured via a YAML file (default: config.yaml). This file defines the Ollama endpoints, connection limits, and API keys.
Basic Configuration
# config.yaml
endpoints:
- http://localhost:11434
- http://ollama-server:11434
# Maximum concurrent connections *per endpoint‑model pair*
max_concurrent_connections: 2
# Optional router-level API key to secure the router and dashboard (leave blank to disable)
nomyo-router-api-key: ""
Complete Example
# config.yaml
endpoints:
- http://192.168.0.50:11434
- http://192.168.0.51:11434
- http://192.168.0.52:11434
- https://api.openai.com/v1
# Maximum concurrent connections *per endpoint‑model pair* (equals to OLLAMA_NUM_PARALLEL)
max_concurrent_connections: 2
# Per-endpoint overrides — any field not listed falls back to the global default (optional)
# endpoint_config:
# "http://192.168.0.50:11434":
# max_concurrent_connections: 4
# "http://192.168.0.51:11434":
# max_concurrent_connections: 1
# Priority / WRR routing (optional, default: false)
# priority_routing: true
# Optional router-level API key to secure the router and dashboard (leave blank to disable)
nomyo-router-api-key: ""
# API keys for remote endpoints
# Set an environment variable like OPENAI_KEY
# Confirm endpoints are exactly as in endpoints block
api_keys:
"http://192.168.0.50:11434": "ollama"
"http://192.168.0.51:11434": "ollama"
"http://192.168.0.52:11434": "ollama"
"https://api.openai.com/v1": "${OPENAI_KEY}"
Configuration Options
endpoints
Type: list[str]
Description: List of Ollama endpoint URLs. Can include both Ollama endpoints (http://host:11434) and OpenAI-compatible endpoints (https://api.openai.com/v1).
Examples:
endpoints:
- http://localhost:11434
- http://ollama1:11434
- http://ollama2:11434
- https://api.openai.com/v1
- https://api.anthropic.com/v1
Notes:
- Ollama endpoints use the standard
/api/prefix - OpenAI-compatible endpoints use
/v1prefix - The router automatically detects endpoint type based on URL pattern
max_concurrent_connections
Type: int
Default: 1
Description: Maximum number of concurrent connections allowed per endpoint-model pair. This corresponds to Ollama's OLLAMA_NUM_PARALLEL setting.
Example:
max_concurrent_connections: 4
Notes:
- This setting controls how many requests can be processed simultaneously for a specific model on a specific endpoint
- When this limit is reached, the router will route requests to other endpoints with available capacity
- Higher values allow more parallel requests but may increase memory usage
endpoint_config
Type: dict[str, dict] (optional)
Default: {} (all endpoints use the global max_concurrent_connections)
Description: Per-endpoint overrides for configuration values. The endpoint URL must match the entry in endpoints exactly. Any field not listed falls back to the global default.
Supported per-endpoint fields:
| Field | Description |
|---|---|
max_concurrent_connections |
Overrides the global limit for this endpoint only |
Example:
endpoint_config:
"http://192.168.0.50:11434":
max_concurrent_connections: 4 # high-memory GPU node
"http://192.168.0.51:11434":
max_concurrent_connections: 1 # low-memory node
Notes:
- Useful when endpoints have different hardware capacity.
- The utilization ratio used by WRR (
priority_routing: true) is computed per-endpoint using the effective limit, so a node withmax_concurrent_connections: 4running 2 requests is considered 50% utilized, same as a node with limit 2 running 1 request.
priority_routing
Type: bool (optional)
Default: false
Description: Selects the load-balancing algorithm used when multiple endpoints are available for a request.
| Value | Algorithm |
|---|---|
false (default) |
Random selection among equally-idle endpoints; otherwise pick the least-loaded endpoint by raw connection count. |
true |
Weighted Round Robin (WRR) — endpoints are ranked by utilization ratio (active_connections / max_concurrent_connections). Config order acts as the tiebreaker: the endpoint listed first in endpoints is preferred when two candidates have equal utilization. |
Example:
priority_routing: true
When to use WRR:
- You have a primary GPU node and one or more fallback nodes, and want the primary to absorb all traffic until it is genuinely saturated.
- Combined with
endpoint_configto give the primary a highermax_concurrent_connections, so the utilization ratio reflects real capacity rather than raw slot counts.
Example — primary/fallback setup:
endpoints:
- http://gpu-primary:11434 # preferred
- http://gpu-secondary:11434 # fallback
endpoint_config:
"http://gpu-primary:11434":
max_concurrent_connections: 4
"http://gpu-secondary:11434":
max_concurrent_connections: 2
priority_routing: true
With this config the primary handles up to 4 concurrent requests before the secondary receives any traffic.
conversation_affinity
Type: bool (optional)
Default: false
Companion setting: conversation_affinity_ttl
Description: When enabled, the router prefers to send follow-up requests of the same conversation back to the endpoint that already served the first turn. This keeps the backend's prompt cache (the llama.cpp / Ollama KV cache) warm: the first user turn pays the cold prefill cost, every later turn reuses the same prefix and only generates new tokens. It is a soft preference — when the previously-chosen endpoint is no longer eligible (model unloaded, no free slot), the router falls back to the standard selection algorithm (priority_routing or random).
How a conversation is identified
The router does not track session IDs or auth tokens. It computes a stable fingerprint per request from:
SHA1( model
+ every leading message with role="system"
+ the first message with role="user" )
Anything after the first user turn is ignored — those later messages extend the same KV prefix, so they don't change the cache identity.
What this means in practice
| You send… | Fingerprint behaves like… |
|---|---|
| Turn 2 of the same chat (history grows but first system+user are unchanged) | Same as turn 1 → pin is reused and TTL refreshed |
| Turn 1 of a fresh chat | New fingerprint → new pin |
| Same first user prompt but a different model | New fingerprint (model is part of the hash) |
| Same chat but the client mutates the system prompt between turns (e.g. injects a fresh timestamp) | New fingerprint — the affinity will not stick |
TTL and refresh
Every time choose_endpoint returns a pinned endpoint, the entry's expiry is bumped to now + conversation_affinity_ttl. An idle conversation drops out of the map once that window elapses without traffic. Default 300 s matches Ollama's default keep_alive — once the backend has unloaded the model, the KV cache is gone too, so a stale pin would be pointless anyway.
Why the dashboard may show more than one dot per visible conversation
The fingerprint is computed per HTTP request, not per chat-window. Most chat UIs (Open WebUI in particular) fire several auxiliary requests alongside the real conversation:
- Title generation — synthetic system prompt + the user message as content
- Follow-up question suggestion — synthetic system prompt + the conversation as content
- Tag generation, memory extraction, retrieval query rewriting, etc.
Each of those has its own (system + first user turn) and therefore its own fingerprint and its own pin in the affinity dot matrix. They all correctly refer to a real warm KV-cache prefix on the backend, so the routing they drive is right — they just don't visually map 1:1 to a user-perceived "conversation."
Example
endpoints:
- http://gpu-primary:11434
- http://gpu-secondary:11434
conversation_affinity: true
conversation_affinity_ttl: 300
With this configuration, a chat that starts on gpu-primary will keep returning to gpu-primary for follow-up turns as long as the model is still loaded there and a slot is free, even if gpu-secondary happens to be more idle at that moment. Cold-prefill cost is paid once instead of once per turn.
When to enable
- ✅ Interactive chat workloads with long histories — the prefill savings on every follow-up turn are substantial.
- ✅ Multi-endpoint deployments where models are loaded on more than one node.
- ❌ Pure one-shot / single-turn workloads (no KV-cache to keep warm).
- ❌ When you specifically want strict load-balancing parity — affinity intentionally biases against perfect balance.
conversation_affinity_ttl
Type: int (seconds, optional)
Default: 300
Description: How long a conversation stays pinned to its endpoint after the last request that touched it. Refreshed on every reuse — so an actively-used conversation keeps its pin indefinitely; an abandoned one expires after conversation_affinity_ttl seconds of silence.
Recommendation: leave this aligned with the backend's keep_alive window. If the model is unloaded by the backend, the KV cache is gone and there is no benefit to keeping the pin.
Example:
conversation_affinity: true
conversation_affinity_ttl: 600 # half an hour of inactivity before un-pinning
router_api_key
Type: str (optional)
Description: Shared secret that gates access to the NOMYO Router APIs and dashboard. When set, clients must send Authorization: Bearer <key> or an api_key query parameter.
Example:
nomyo-router-api-key: "super-secret-value"
Notes:
- Leave this blank or omit it to disable router-level authentication.
- You can also set the
NOMYO_ROUTER_API_KEYenvironment variable to avoid storing the key in plain text.
api_keys
Type: dict[str, str]
Description: Mapping of endpoint URLs to API keys. Used for authenticating with remote endpoints.
Example:
api_keys:
"http://192.168.0.50:11434": "ollama"
"https://api.openai.com/v1": "${OPENAI_KEY}"
Environment Variables:
- API keys can reference environment variables using
${VAR_NAME}syntax - The router will expand these references at startup
- Example:
${OPENAI_KEY}will be replaced with the value of theOPENAI_KEYenvironment variable
Environment Variables
NOMYO_ROUTER_CONFIG_PATH
Description: Path to the configuration file. If not set, defaults to config.yaml in the current working directory.
Example:
export NOMYO_ROUTER_CONFIG_PATH=/etc/nomyo-router/config.yaml
NOMYO_ROUTER_DB_PATH
Description: Path to the SQLite database file for storing token counts. If not set, defaults to token_counts.db in the current working directory.
Example:
export NOMYO_ROUTER_DB_PATH=/var/lib/nomyo-router/token_counts.db
NOMYO_ROUTER_API_KEY
Description: Router-level API key. When set, all router endpoints and the dashboard require this key via Authorization: Bearer <key> or the api_key query parameter.
Example:
export NOMYO_ROUTER_API_KEY=your_router_api_key
API-Specific Keys
You can set API keys directly as environment variables:
export OPENAI_KEY=your_openai_api_key
export ANTHROPIC_KEY=your_anthropic_api_key
Configuration Best Practices
Multiple Ollama Instances
For a cluster of Ollama instances:
endpoints:
- http://ollama-worker1:11434
- http://ollama-worker2:11434
- http://ollama-worker3:11434
max_concurrent_connections: 2
Recommendation: Set max_concurrent_connections to match your Ollama instances' OLLAMA_NUM_PARALLEL setting.
Mixed Endpoints
Combining Ollama and OpenAI endpoints:
endpoints:
- http://localhost:11434
- https://api.openai.com/v1
api_keys:
"https://api.openai.com/v1": "${OPENAI_KEY}"
Note: The router will automatically route requests based on model availability across all endpoints.
High Availability
For production deployments:
endpoints:
- http://ollama-primary:11434
- http://ollama-secondary:11434
- http://ollama-tertiary:11434
max_concurrent_connections: 3
Recommendation: Use multiple endpoints for redundancy and load distribution.
Priority Routing (Primary + Fallback)
When you have heterogeneous hardware and want to prefer a faster node:
endpoints:
- http://gpu-primary:11434 # high-VRAM node, listed first = highest priority
- http://gpu-secondary:11434
endpoint_config:
"http://gpu-primary:11434":
max_concurrent_connections: 4
"http://gpu-secondary:11434":
max_concurrent_connections: 2
priority_routing: true
The router sends all requests to the primary until its utilization ratio reaches 100%, then spills over to the secondary. Without priority_routing: true the default behaviour is random selection among idle endpoints.
Semantic LLM Cache
NOMYO Router can cache LLM responses and serve them directly — skipping endpoint selection, model load, and token generation entirely.
How it works
- On every cacheable request (
/api/chat,/api/generate,/v1/chat/completions,/v1/completions) the cache is checked before choosing an endpoint. - On a cache hit the stored response is returned immediately as a single chunk (streaming or non-streaming — both work).
- On a cache miss the request is forwarded normally. The response is stored in the cache after it completes.
- MOE requests (
moe-*model prefix) always bypass the cache. - Token counts are never recorded for cache hits.
Cache key strategy
| Signal | How matched |
|---|---|
model + system_prompt |
Exact — hard context isolation per deployment |
| BM25-weighted embedding of chat history | Semantic — conversation context signal |
| Embedding of last user message | Semantic — the actual question |
The two semantic vectors are combined as a weighted mean (tuned by cache_history_weight) before cosine similarity comparison, staying at a single 384-dimensional vector compatible with the library's storage format.
Quick start — exact match (lean image)
cache_enabled: true
cache_backend: sqlite # persists across restarts
cache_similarity: 1.0 # exact match only, no sentence-transformers needed
cache_ttl: 3600
Quick start — semantic matching (:semantic image)
cache_enabled: true
cache_backend: sqlite
cache_similarity: 0.90 # hit if ≥90% cosine similarity
cache_ttl: 3600
cache_history_weight: 0.3
Pull the semantic image:
docker pull ghcr.io/nomyo-ai/nomyo-router:latest-semantic
Cache configuration options
cache_enabled
Type: bool | Default: false
Enable or disable the cache. All other cache settings are ignored when false.
cache_backend
Type: str | Default: "memory"
| Value | Description | Persists | Multi-replica |
|---|---|---|---|
memory |
In-process LRU dict | ❌ | ❌ |
sqlite |
File-based via aiosqlite |
✅ | ❌ |
redis |
Redis via redis.asyncio |
✅ | ✅ |
Use redis when running multiple router replicas behind a load balancer — all replicas share one warm cache.
cache_similarity
Type: float | Default: 1.0
Cosine similarity threshold. 1.0 means exact match only (no embedding model needed). Values below 1.0 enable semantic matching, which requires the :semantic Docker image tag.
Recommended starting value for semantic mode: 0.90.
cache_ttl
Type: int | null | Default: 3600
Time-to-live for cache entries in seconds. Remove the key or set to null to cache forever.
cache_db_path
Type: str | Default: "llm_cache.db"
Path to the SQLite cache database. Only used when cache_backend: sqlite.
cache_redis_url
Type: str | Default: "redis://localhost:6379/0"
Redis connection URL. Only used when cache_backend: redis.
cache_history_weight
Type: float | Default: 0.3
Weight of the BM25-weighted chat-history embedding in the combined cache key vector. 0.3 means the history contributes 30% and the final user message contributes 70% of the similarity signal. Only used when cache_similarity < 1.0.
Cache management endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/cache/stats |
GET |
Hit/miss counters, hit rate, current config |
/api/cache/invalidate |
POST |
Clear all cache entries and reset counters |
# Check cache performance
curl http://localhost:12434/api/cache/stats
# Clear the cache
curl -X POST http://localhost:12434/api/cache/invalidate
Example stats response:
{
"enabled": true,
"hits": 1547,
"misses": 892,
"hit_rate": 0.634,
"semantic": true,
"backend": "sqlite",
"similarity_threshold": 0.9,
"history_weight": 0.3
}
Docker image variants
| Tag | Semantic cache | Image size |
|---|---|---|
latest |
❌ exact match only | ~300 MB |
latest-semantic |
✅ sentence-transformers + model pre-baked | ~800 MB |
Build locally:
# Lean (exact match)
docker build -t nomyo-router .
# Semantic (~500 MB larger, all-MiniLM-L6-v2 model baked in)
docker build --build-arg SEMANTIC_CACHE=true -t nomyo-router:semantic .
Configuration Validation
The router validates the configuration at startup:
- Endpoint URLs: Must be valid URLs
- API Keys: Must be strings (can reference environment variables)
- Connection Limits: Must be positive integers
If the configuration is invalid, the router will exit with an error message.
Dynamic Configuration
The configuration is loaded at startup and cannot be changed without restarting the router. For production deployments, consider:
- Using a configuration management system
- Implementing a rolling restart strategy
- Using environment variables for sensitive data
Example Configurations
See the examples directory for ready-to-use configuration examples.
Using the router API key
When router_api_key/NOMYO_ROUTER_API_KEY is set, clients must send it on every request:
- Header (recommended): Authorization: Bearer <router_key>
- Query param (fallback): ?api_key=<router_key>
Example:
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags