# Configuration Guide ## Configuration File The NOMYO Router is configured via a YAML file (default: `config.yaml`). This file defines the Ollama endpoints, connection limits, and API keys. ### Basic Configuration ```yaml # config.yaml endpoints: - http://localhost:11434 - http://ollama-server:11434 # Maximum concurrent connections *per endpoint‑model pair* max_concurrent_connections: 2 # Optional router-level API key to secure the router and dashboard (leave blank to disable) nomyo-router-api-key: "" ``` ### Complete Example ```yaml # config.yaml endpoints: - http://192.168.0.50:11434 - http://192.168.0.51:11434 - http://192.168.0.52:11434 - https://api.openai.com/v1 # Maximum concurrent connections *per endpoint‑model pair* (equals to OLLAMA_NUM_PARALLEL) max_concurrent_connections: 2 # Per-endpoint overrides — any field not listed falls back to the global default (optional) # endpoint_config: # "http://192.168.0.50:11434": # max_concurrent_connections: 4 # "http://192.168.0.51:11434": # max_concurrent_connections: 1 # Priority / WRR routing (optional, default: false) # priority_routing: true # Optional router-level API key to secure the router and dashboard (leave blank to disable) nomyo-router-api-key: "" # API keys for remote endpoints # Set an environment variable like OPENAI_KEY # Confirm endpoints are exactly as in endpoints block api_keys: "http://192.168.0.50:11434": "ollama" "http://192.168.0.51:11434": "ollama" "http://192.168.0.52:11434": "ollama" "https://api.openai.com/v1": "${OPENAI_KEY}" ``` ## Configuration Options ### `endpoints` **Type**: `list[str]` **Description**: List of Ollama endpoint URLs. Can include both Ollama endpoints (`http://host:11434`) and OpenAI-compatible endpoints (`https://api.openai.com/v1`). **Examples**: ```yaml endpoints: - http://localhost:11434 - http://ollama1:11434 - http://ollama2:11434 - https://api.openai.com/v1 - https://api.anthropic.com/v1 ``` **Notes**: - Ollama endpoints use the standard `/api/` prefix - OpenAI-compatible endpoints use `/v1` prefix - The router automatically detects endpoint type based on URL pattern ### `max_concurrent_connections` **Type**: `int` **Default**: `1` **Description**: Maximum number of concurrent connections allowed per endpoint-model pair. This corresponds to Ollama's `OLLAMA_NUM_PARALLEL` setting. **Example**: ```yaml max_concurrent_connections: 4 ``` **Notes**: - This setting controls how many requests can be processed simultaneously for a specific model on a specific endpoint - When this limit is reached, the router will route requests to other endpoints with available capacity - Higher values allow more parallel requests but may increase memory usage ### `endpoint_config` **Type**: `dict[str, dict]` (optional) **Default**: `{}` (all endpoints use the global `max_concurrent_connections`) **Description**: Per-endpoint overrides for configuration values. The endpoint URL must match the entry in `endpoints` exactly. Any field not listed falls back to the global default. **Supported per-endpoint fields**: | Field | Description | |---|---| | `max_concurrent_connections` | Overrides the global limit for this endpoint only | **Example**: ```yaml endpoint_config: "http://192.168.0.50:11434": max_concurrent_connections: 4 # high-memory GPU node "http://192.168.0.51:11434": max_concurrent_connections: 1 # low-memory node ``` **Notes**: - Useful when endpoints have different hardware capacity. - The utilization ratio used by WRR (`priority_routing: true`) is computed per-endpoint using the effective limit, so a node with `max_concurrent_connections: 4` running 2 requests is considered 50% utilized, same as a node with limit 2 running 1 request. --- ### `priority_routing` **Type**: `bool` (optional) **Default**: `false` **Description**: Selects the load-balancing algorithm used when multiple endpoints are available for a request. | Value | Algorithm | |---|---| | `false` (default) | Random selection among equally-idle endpoints; otherwise pick the least-loaded endpoint by raw connection count. | | `true` | **Weighted Round Robin (WRR)** — endpoints are ranked by utilization ratio (`active_connections / max_concurrent_connections`). Config order acts as the tiebreaker: the endpoint listed first in `endpoints` is preferred when two candidates have equal utilization. | **Example**: ```yaml priority_routing: true ``` **When to use WRR**: - You have a primary GPU node and one or more fallback nodes, and want the primary to absorb all traffic until it is genuinely saturated. - Combined with `endpoint_config` to give the primary a higher `max_concurrent_connections`, so the utilization ratio reflects real capacity rather than raw slot counts. **Example — primary/fallback setup**: ```yaml endpoints: - http://gpu-primary:11434 # preferred - http://gpu-secondary:11434 # fallback endpoint_config: "http://gpu-primary:11434": max_concurrent_connections: 4 "http://gpu-secondary:11434": max_concurrent_connections: 2 priority_routing: true ``` With this config the primary handles up to 4 concurrent requests before the secondary receives any traffic. --- ### `conversation_affinity` **Type**: `bool` (optional) **Default**: `false` **Companion setting**: [`conversation_affinity_ttl`](#conversation_affinity_ttl) **Description**: When enabled, the router prefers to send follow-up requests of the same conversation back to the endpoint that already served the first turn. This keeps the backend's prompt cache (the llama.cpp / Ollama **KV cache**) warm: the first user turn pays the cold prefill cost, every later turn reuses the same prefix and only generates new tokens. It is a **soft preference** — when the previously-chosen endpoint is no longer eligible (model unloaded, no free slot), the router falls back to the standard selection algorithm (`priority_routing` or random). #### How a conversation is identified The router does **not** track session IDs or auth tokens. It computes a stable fingerprint per request from: ``` SHA1( model + every leading message with role="system" + the first message with role="user" ) ``` Anything after the first user turn is ignored — those later messages extend the same KV prefix, so they don't change the cache identity. **What this means in practice** | You send… | Fingerprint behaves like… | |---|---| | Turn 2 of the same chat (history grows but first system+user are unchanged) | **Same** as turn 1 → pin is reused and TTL refreshed | | Turn 1 of a fresh chat | **New** fingerprint → new pin | | Same first user prompt but a different model | **New** fingerprint (model is part of the hash) | | Same chat but the client mutates the system prompt between turns (e.g. injects a fresh timestamp) | **New** fingerprint — the affinity will not stick | #### TTL and refresh Every time `choose_endpoint` returns a pinned endpoint, the entry's expiry is bumped to `now + conversation_affinity_ttl`. An idle conversation drops out of the map once that window elapses without traffic. Default 300 s matches Ollama's default `keep_alive` — once the backend has unloaded the model, the KV cache is gone too, so a stale pin would be pointless anyway. #### Why the dashboard may show more than one dot per visible conversation The fingerprint is computed per **HTTP request**, not per chat-window. Most chat UIs (Open WebUI in particular) fire several **auxiliary** requests alongside the real conversation: - *Title generation* — synthetic system prompt + the user message as content - *Follow-up question suggestion* — synthetic system prompt + the conversation as content - *Tag generation*, *memory extraction*, *retrieval query rewriting*, etc. Each of those has its own `(system + first user turn)` and therefore its own fingerprint and its own pin in [the affinity dot matrix](monitoring.md#affinity-stats-conversation-affinity). They all *correctly* refer to a real warm KV-cache prefix on the backend, so the routing they drive is right — they just don't visually map 1:1 to a user-perceived "conversation." #### Example ```yaml endpoints: - http://gpu-primary:11434 - http://gpu-secondary:11434 conversation_affinity: true conversation_affinity_ttl: 300 ``` With this configuration, a chat that starts on `gpu-primary` will keep returning to `gpu-primary` for follow-up turns as long as the model is still loaded there and a slot is free, even if `gpu-secondary` happens to be more idle at that moment. Cold-prefill cost is paid once instead of once per turn. #### When to enable - ✅ Interactive chat workloads with long histories — the prefill savings on every follow-up turn are substantial. - ✅ Multi-endpoint deployments where models are loaded on more than one node. - ❌ Pure one-shot / single-turn workloads (no KV-cache to keep warm). - ❌ When you specifically want strict load-balancing parity — affinity intentionally biases against perfect balance. --- ### `conversation_affinity_ttl` **Type**: `int` (seconds, optional) **Default**: `300` **Description**: How long a conversation stays pinned to its endpoint after the last request that touched it. Refreshed on every reuse — so an actively-used conversation keeps its pin indefinitely; an abandoned one expires after `conversation_affinity_ttl` seconds of silence. **Recommendation**: leave this aligned with the backend's `keep_alive` window. If the model is unloaded by the backend, the KV cache is gone and there is no benefit to keeping the pin. **Example**: ```yaml conversation_affinity: true conversation_affinity_ttl: 600 # half an hour of inactivity before un-pinning ``` --- ### `router_api_key` **Type**: `str` (optional) **Description**: Shared secret that gates access to the NOMYO Router APIs and dashboard. When set, clients must send `Authorization: Bearer ` or an `api_key` query parameter. **Example**: ```yaml nomyo-router-api-key: "super-secret-value" ``` **Notes**: - Leave this blank or omit it to disable router-level authentication. - You can also set the `NOMYO_ROUTER_API_KEY` environment variable to avoid storing the key in plain text. ### `api_keys` **Type**: `dict[str, str]` **Description**: Mapping of endpoint URLs to API keys. Used for authenticating with remote endpoints. **Example**: ```yaml api_keys: "http://192.168.0.50:11434": "ollama" "https://api.openai.com/v1": "${OPENAI_KEY}" ``` **Environment Variables**: - API keys can reference environment variables using `${VAR_NAME}` syntax - The router will expand these references at startup - Example: `${OPENAI_KEY}` will be replaced with the value of the `OPENAI_KEY` environment variable ## Environment Variables ### `NOMYO_ROUTER_CONFIG_PATH` **Description**: Path to the configuration file. If not set, defaults to `config.yaml` in the current working directory. **Example**: ```bash export NOMYO_ROUTER_CONFIG_PATH=/etc/nomyo-router/config.yaml ``` ### `NOMYO_ROUTER_DB_PATH` **Description**: Path to the SQLite database file for storing token counts. If not set, defaults to `token_counts.db` in the current working directory. **Example**: ```bash export NOMYO_ROUTER_DB_PATH=/var/lib/nomyo-router/token_counts.db ``` ### `NOMYO_ROUTER_API_KEY` **Description**: Router-level API key. When set, all router endpoints and the dashboard require this key via `Authorization: Bearer ` or the `api_key` query parameter. **Example**: ```bash export NOMYO_ROUTER_API_KEY=your_router_api_key ``` ### API-Specific Keys You can set API keys directly as environment variables: ```bash export OPENAI_KEY=your_openai_api_key export ANTHROPIC_KEY=your_anthropic_api_key ``` ## Configuration Best Practices ### Multiple Ollama Instances For a cluster of Ollama instances: ```yaml endpoints: - http://ollama-worker1:11434 - http://ollama-worker2:11434 - http://ollama-worker3:11434 max_concurrent_connections: 2 ``` **Recommendation**: Set `max_concurrent_connections` to match your Ollama instances' `OLLAMA_NUM_PARALLEL` setting. ### Mixed Endpoints Combining Ollama and OpenAI endpoints: ```yaml endpoints: - http://localhost:11434 - https://api.openai.com/v1 api_keys: "https://api.openai.com/v1": "${OPENAI_KEY}" ``` **Note**: The router will automatically route requests based on model availability across all endpoints. ### High Availability For production deployments: ```yaml endpoints: - http://ollama-primary:11434 - http://ollama-secondary:11434 - http://ollama-tertiary:11434 max_concurrent_connections: 3 ``` **Recommendation**: Use multiple endpoints for redundancy and load distribution. ### Priority Routing (Primary + Fallback) When you have heterogeneous hardware and want to prefer a faster node: ```yaml endpoints: - http://gpu-primary:11434 # high-VRAM node, listed first = highest priority - http://gpu-secondary:11434 endpoint_config: "http://gpu-primary:11434": max_concurrent_connections: 4 "http://gpu-secondary:11434": max_concurrent_connections: 2 priority_routing: true ``` The router sends all requests to the primary until its utilization ratio reaches 100%, then spills over to the secondary. Without `priority_routing: true` the default behaviour is random selection among idle endpoints. ## Semantic LLM Cache NOMYO Router can cache LLM responses and serve them directly — skipping endpoint selection, model load, and token generation entirely. ### How it works 1. On every cacheable request (`/api/chat`, `/api/generate`, `/v1/chat/completions`, `/v1/completions`) the cache is checked **before** choosing an endpoint. 2. On a **cache hit** the stored response is returned immediately as a single chunk (streaming or non-streaming — both work). 3. On a **cache miss** the request is forwarded normally. The response is stored in the cache after it completes. 4. **MOE requests** (`moe-*` model prefix) always bypass the cache. 5. **Token counts** are never recorded for cache hits. ### Cache key strategy | Signal | How matched | |---|---| | `model + system_prompt` | Exact — hard context isolation per deployment | | BM25-weighted embedding of chat history | Semantic — conversation context signal | | Embedding of last user message | Semantic — the actual question | The two semantic vectors are combined as a weighted mean (tuned by `cache_history_weight`) before cosine similarity comparison, staying at a single 384-dimensional vector compatible with the library's storage format. ### Quick start — exact match (lean image) ```yaml cache_enabled: true cache_backend: sqlite # persists across restarts cache_similarity: 1.0 # exact match only, no sentence-transformers needed cache_ttl: 3600 ``` ### Quick start — semantic matching (:semantic image) ```yaml cache_enabled: true cache_backend: sqlite cache_similarity: 0.90 # hit if ≥90% cosine similarity cache_ttl: 3600 cache_history_weight: 0.3 ``` Pull the semantic image: ```bash docker pull ghcr.io/nomyo-ai/nomyo-router:latest-semantic ``` ### Cache configuration options #### `cache_enabled` **Type**: `bool` | **Default**: `false` Enable or disable the cache. All other cache settings are ignored when `false`. #### `cache_backend` **Type**: `str` | **Default**: `"memory"` | Value | Description | Persists | Multi-replica | |---|---|---|---| | `memory` | In-process LRU dict | ❌ | ❌ | | `sqlite` | File-based via `aiosqlite` | ✅ | ❌ | | `redis` | Redis via `redis.asyncio` | ✅ | ✅ | Use `redis` when running multiple router replicas behind a load balancer — all replicas share one warm cache. #### `cache_similarity` **Type**: `float` | **Default**: `1.0` Cosine similarity threshold. `1.0` means exact match only (no embedding model needed). Values below `1.0` enable semantic matching, which requires the `:semantic` Docker image tag. Recommended starting value for semantic mode: `0.90`. #### `cache_ttl` **Type**: `int | null` | **Default**: `3600` Time-to-live for cache entries in seconds. Remove the key or set to `null` to cache forever. #### `cache_db_path` **Type**: `str` | **Default**: `"llm_cache.db"` Path to the SQLite cache database. Only used when `cache_backend: sqlite`. #### `cache_redis_url` **Type**: `str` | **Default**: `"redis://localhost:6379/0"` Redis connection URL. Only used when `cache_backend: redis`. #### `cache_history_weight` **Type**: `float` | **Default**: `0.3` Weight of the BM25-weighted chat-history embedding in the combined cache key vector. `0.3` means the history contributes 30% and the final user message contributes 70% of the similarity signal. Only used when `cache_similarity < 1.0`. ### Cache management endpoints | Endpoint | Method | Description | |---|---|---| | `/api/cache/stats` | `GET` | Hit/miss counters, hit rate, current config | | `/api/cache/invalidate` | `POST` | Clear all cache entries and reset counters | ```bash # Check cache performance curl http://localhost:12434/api/cache/stats # Clear the cache curl -X POST http://localhost:12434/api/cache/invalidate ``` Example stats response: ```json { "enabled": true, "hits": 1547, "misses": 892, "hit_rate": 0.634, "semantic": true, "backend": "sqlite", "similarity_threshold": 0.9, "history_weight": 0.3 } ``` ### Docker image variants | Tag | Semantic cache | Image size | |---|---|---| | `latest` | ❌ exact match only | ~300 MB | | `latest-semantic` | ✅ sentence-transformers + model pre-baked | ~800 MB | Build locally: ```bash # Lean (exact match) docker build -t nomyo-router . # Semantic (~500 MB larger, all-MiniLM-L6-v2 model baked in) docker build --build-arg SEMANTIC_CACHE=true -t nomyo-router:semantic . ``` ## Configuration Validation The router validates the configuration at startup: 1. **Endpoint URLs**: Must be valid URLs 2. **API Keys**: Must be strings (can reference environment variables) 3. **Connection Limits**: Must be positive integers If the configuration is invalid, the router will exit with an error message. ## Dynamic Configuration The configuration is loaded at startup and cannot be changed without restarting the router. For production deployments, consider: 1. Using a configuration management system 2. Implementing a rolling restart strategy 3. Using environment variables for sensitive data ## Example Configurations See the [examples](examples/) directory for ready-to-use configuration examples. ### Using the router API key When `router_api_key`/`NOMYO_ROUTER_API_KEY` is set, clients must send it on every request: - Header (recommended): Authorization: Bearer - Query param (fallback): ?api_key= Example: ```bash curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags ```