nomyo-router/doc/configuration.md

# Configuration Guide

## Configuration File

The NOMYO Router is configured via a YAML file (default: `config.yaml`). This file defines the Ollama endpoints, connection limits, and API keys.

### Basic Configuration

```yaml
# config.yaml
endpoints:
  - http://localhost:11434
  - http://ollama-server:11434

# Maximum concurrent connections *per endpoint‑model pair*
max_concurrent_connections: 2

# Optional router-level API key to secure the router and dashboard (leave blank to disable)
nomyo-router-api-key: ""
```

### Complete Example

```yaml
# config.yaml
endpoints:
  - http://192.168.0.50:11434
  - http://192.168.0.51:11434
  - http://192.168.0.52:11434
  - https://api.openai.com/v1

# Maximum concurrent connections *per endpoint‑model pair* (equals to OLLAMA_NUM_PARALLEL)
max_concurrent_connections: 2

# Per-endpoint overrides — any field not listed falls back to the global default (optional)
# endpoint_config:
#   "http://192.168.0.50:11434":
#     max_concurrent_connections: 4
#   "http://192.168.0.51:11434":
#     max_concurrent_connections: 1

# Priority / WRR routing (optional, default: false)
# priority_routing: true

# Optional router-level API key to secure the router and dashboard (leave blank to disable)
nomyo-router-api-key: ""

# API keys for remote endpoints
# Set an environment variable like OPENAI_KEY
# Confirm endpoints are exactly as in endpoints block
api_keys:
  "http://192.168.0.50:11434": "ollama"
  "http://192.168.0.51:11434": "ollama"
  "http://192.168.0.52:11434": "ollama"
  "https://api.openai.com/v1": "${OPENAI_KEY}"
```

## Configuration Options

### `endpoints`

**Type**: `list[str]`

**Description**: List of Ollama endpoint URLs. Can include both Ollama endpoints (`http://host:11434`) and OpenAI-compatible endpoints (`https://api.openai.com/v1`).

**Examples**:
```yaml
endpoints:
  - http://localhost:11434
  - http://ollama1:11434
  - http://ollama2:11434
  - https://api.openai.com/v1
  - https://api.anthropic.com/v1
```

**Notes**:
- Ollama endpoints use the standard `/api/` prefix
- OpenAI-compatible endpoints use `/v1` prefix
- The router automatically detects endpoint type based on URL pattern

### `max_concurrent_connections`

**Type**: `int`

**Default**: `1`

**Description**: Maximum number of concurrent connections allowed per endpoint-model pair. This corresponds to Ollama's `OLLAMA_NUM_PARALLEL` setting.

**Example**:
```yaml
max_concurrent_connections: 4
```

**Notes**:
- This setting controls how many requests can be processed simultaneously for a specific model on a specific endpoint
- When this limit is reached, the router will route requests to other endpoints with available capacity
- Higher values allow more parallel requests but may increase memory usage

### `endpoint_config`

**Type**: `dict[str, dict]` (optional)

**Default**: `{}` (all endpoints use the global `max_concurrent_connections`)

**Description**: Per-endpoint overrides for configuration values. The endpoint URL must match the entry in `endpoints` exactly. Any field not listed falls back to the global default.

**Supported per-endpoint fields**:

| Field | Description |
|---|---|
| `max_concurrent_connections` | Overrides the global limit for this endpoint only |

**Example**:
```yaml
endpoint_config:
  "http://192.168.0.50:11434":
    max_concurrent_connections: 4   # high-memory GPU node
  "http://192.168.0.51:11434":
    max_concurrent_connections: 1   # low-memory node
```

**Notes**:
- Useful when endpoints have different hardware capacity.
- The utilization ratio used by WRR (`priority_routing: true`) is computed per-endpoint using the effective limit, so a node with `max_concurrent_connections: 4` running 2 requests is considered 50% utilized, same as a node with limit 2 running 1 request.

---

### `priority_routing`

**Type**: `bool` (optional)

**Default**: `false`

**Description**: Selects the load-balancing algorithm used when multiple endpoints are available for a request.

| Value | Algorithm |
|---|---|
| `false` (default) | Random selection among equally-idle endpoints; otherwise pick the least-loaded endpoint by raw connection count. |
| `true` | **Weighted Round Robin (WRR)** — endpoints are ranked by utilization ratio (`active_connections / max_concurrent_connections`). Config order acts as the tiebreaker: the endpoint listed first in `endpoints` is preferred when two candidates have equal utilization. |

**Example**:
```yaml
priority_routing: true
```

**When to use WRR**:
- You have a primary GPU node and one or more fallback nodes, and want the primary to absorb all traffic until it is genuinely saturated.
- Combined with `endpoint_config` to give the primary a higher `max_concurrent_connections`, so the utilization ratio reflects real capacity rather than raw slot counts.

**Example — primary/fallback setup**:
```yaml
endpoints:
  - http://gpu-primary:11434    # preferred
  - http://gpu-secondary:11434  # fallback

endpoint_config:
  "http://gpu-primary:11434":
    max_concurrent_connections: 4
  "http://gpu-secondary:11434":
    max_concurrent_connections: 2

priority_routing: true
```

With this config the primary handles up to 4 concurrent requests before the secondary receives any traffic.

---

### `conversation_affinity`

**Type**: `bool` (optional)

**Default**: `false`

**Companion setting**: [`conversation_affinity_ttl`](#conversation_affinity_ttl)

**Description**: When enabled, the router prefers to send follow-up requests of the same conversation back to the endpoint that already served the first turn. This keeps the backend's prompt cache (the llama.cpp / Ollama **KV cache**) warm: the first user turn pays the cold prefill cost, every later turn reuses the same prefix and only generates new tokens. It is a **soft preference** — when the previously-chosen endpoint is no longer eligible (model unloaded, no free slot), the router falls back to the standard selection algorithm (`priority_routing` or random).

#### How a conversation is identified

The router does **not** track session IDs or auth tokens. It computes a stable fingerprint per request from:

```
SHA1(  model
     + every leading message with role="system"
     + the first message with role="user"  )
```

Anything after the first user turn is ignored — those later messages extend the same KV prefix, so they don't change the cache identity.

**What this means in practice**

| You send… | Fingerprint behaves like… |
|---|---|
| Turn 2 of the same chat (history grows but first system+user are unchanged) | **Same** as turn 1 → pin is reused and TTL refreshed |
| Turn 1 of a fresh chat | **New** fingerprint → new pin |
| Same first user prompt but a different model | **New** fingerprint (model is part of the hash) |
| Same chat but the client mutates the system prompt between turns (e.g. injects a fresh timestamp) | **New** fingerprint — the affinity will not stick |

#### TTL and refresh

Every time `choose_endpoint` returns a pinned endpoint, the entry's expiry is bumped to `now + conversation_affinity_ttl`. An idle conversation drops out of the map once that window elapses without traffic. Default 300 s matches Ollama's default `keep_alive` — once the backend has unloaded the model, the KV cache is gone too, so a stale pin would be pointless anyway.

#### Why the dashboard may show more than one dot per visible conversation

The fingerprint is computed per **HTTP request**, not per chat-window. Most chat UIs (Open WebUI in particular) fire several **auxiliary** requests alongside the real conversation:

- *Title generation* — synthetic system prompt + the user message as content
- *Follow-up question suggestion* — synthetic system prompt + the conversation as content
- *Tag generation*, *memory extraction*, *retrieval query rewriting*, etc.

Each of those has its own `(system + first user turn)` and therefore its own fingerprint and its own pin in [the affinity dot matrix](monitoring.md#affinity-stats-conversation-affinity). They all *correctly* refer to a real warm KV-cache prefix on the backend, so the routing they drive is right — they just don't visually map 1:1 to a user-perceived "conversation."

#### Example

```yaml
endpoints:
  - http://gpu-primary:11434
  - http://gpu-secondary:11434

conversation_affinity: true
conversation_affinity_ttl: 300
```

With this configuration, a chat that starts on `gpu-primary` will keep returning to `gpu-primary` for follow-up turns as long as the model is still loaded there and a slot is free, even if `gpu-secondary` happens to be more idle at that moment. Cold-prefill cost is paid once instead of once per turn.

#### When to enable

- ✅ Interactive chat workloads with long histories — the prefill savings on every follow-up turn are substantial.
- ✅ Multi-endpoint deployments where models are loaded on more than one node.
- ❌ Pure one-shot / single-turn workloads (no KV-cache to keep warm).
- ❌ When you specifically want strict load-balancing parity — affinity intentionally biases against perfect balance.

---

### `conversation_affinity_ttl`

**Type**: `int` (seconds, optional)

**Default**: `300`

**Description**: How long a conversation stays pinned to its endpoint after the last request that touched it. Refreshed on every reuse — so an actively-used conversation keeps its pin indefinitely; an abandoned one expires after `conversation_affinity_ttl` seconds of silence.

**Recommendation**: leave this aligned with the backend's `keep_alive` window. If the model is unloaded by the backend, the KV cache is gone and there is no benefit to keeping the pin.

**Example**:
```yaml
conversation_affinity: true
conversation_affinity_ttl: 600   # half an hour of inactivity before un-pinning
```

---

### `router_api_key`

**Type**: `str` (optional)

**Description**: Shared secret that gates access to the NOMYO Router APIs and dashboard. When set, clients must send `Authorization: Bearer <key>` or an `api_key` query parameter.

**Example**:
```yaml
nomyo-router-api-key: "super-secret-value"
```

**Notes**:
- Leave this blank or omit it to disable router-level authentication.
- You can also set the `NOMYO_ROUTER_API_KEY` environment variable to avoid storing the key in plain text.

### `api_keys`

**Type**: `dict[str, str]`

**Description**: Mapping of endpoint URLs to API keys. Used for authenticating with remote endpoints.

**Example**:
```yaml
api_keys:
  "http://192.168.0.50:11434": "ollama"
  "https://api.openai.com/v1": "${OPENAI_KEY}"
```

**Environment Variables**:
- API keys can reference environment variables using `${VAR_NAME}` syntax
- The router will expand these references at startup
- Example: `${OPENAI_KEY}` will be replaced with the value of the `OPENAI_KEY` environment variable

## Environment Variables

### `NOMYO_ROUTER_CONFIG_PATH`

**Description**: Path to the configuration file. If not set, defaults to `config.yaml` in the current working directory.

**Example**:
```bash
export NOMYO_ROUTER_CONFIG_PATH=/etc/nomyo-router/config.yaml
```

### `NOMYO_ROUTER_DB_PATH`

**Description**: Path to the SQLite database file for storing token counts. If not set, defaults to `token_counts.db` in the current working directory.

**Example**:
```bash
export NOMYO_ROUTER_DB_PATH=/var/lib/nomyo-router/token_counts.db
```

### `NOMYO_ROUTER_API_KEY`

**Description**: Router-level API key. When set, all router endpoints and the dashboard require this key via `Authorization: Bearer <key>` or the `api_key` query parameter.

**Example**:
```bash
export NOMYO_ROUTER_API_KEY=your_router_api_key
```

### API-Specific Keys

You can set API keys directly as environment variables:

```bash
export OPENAI_KEY=your_openai_api_key
export ANTHROPIC_KEY=your_anthropic_api_key
```

## Configuration Best Practices

### Multiple Ollama Instances

For a cluster of Ollama instances:

```yaml
endpoints:
  - http://ollama-worker1:11434
  - http://ollama-worker2:11434
  - http://ollama-worker3:11434

max_concurrent_connections: 2
```

**Recommendation**: Set `max_concurrent_connections` to match your Ollama instances' `OLLAMA_NUM_PARALLEL` setting.

### Mixed Endpoints

Combining Ollama and OpenAI endpoints:

```yaml
endpoints:
  - http://localhost:11434
  - https://api.openai.com/v1

api_keys:
  "https://api.openai.com/v1": "${OPENAI_KEY}"
```

**Note**: The router will automatically route requests based on model availability across all endpoints.

### High Availability

For production deployments:

```yaml
endpoints:
  - http://ollama-primary:11434
  - http://ollama-secondary:11434
  - http://ollama-tertiary:11434

max_concurrent_connections: 3
```

**Recommendation**: Use multiple endpoints for redundancy and load distribution.

### Priority Routing (Primary + Fallback)

When you have heterogeneous hardware and want to prefer a faster node:

```yaml
endpoints:
  - http://gpu-primary:11434      # high-VRAM node, listed first = highest priority
  - http://gpu-secondary:11434

endpoint_config:
  "http://gpu-primary:11434":
    max_concurrent_connections: 4
  "http://gpu-secondary:11434":
    max_concurrent_connections: 2

priority_routing: true
```

The router sends all requests to the primary until its utilization ratio reaches 100%, then spills over to the secondary. Without `priority_routing: true` the default behaviour is random selection among idle endpoints.

## Semantic LLM Cache

NOMYO Router can cache LLM responses and serve them directly — skipping endpoint selection, model load, and token generation entirely.

### How it works

1. On every cacheable request (`/api/chat`, `/api/generate`, `/v1/chat/completions`, `/v1/completions`) the cache is checked **before** choosing an endpoint.
2. On a **cache hit** the stored response is returned immediately as a single chunk (streaming or non-streaming — both work).
3. On a **cache miss** the request is forwarded normally. The response is stored in the cache after it completes.
4. **MOE requests** (`moe-*` model prefix) always bypass the cache.
5. **Token counts** are never recorded for cache hits.

### Cache key strategy

| Signal | How matched |
|---|---|
| `model + system_prompt` | Exact — hard context isolation per deployment |
| BM25-weighted embedding of chat history | Semantic — conversation context signal |
| Embedding of last user message | Semantic — the actual question |

The two semantic vectors are combined as a weighted mean (tuned by `cache_history_weight`) before cosine similarity comparison, staying at a single 384-dimensional vector compatible with the library's storage format.

### Quick start — exact match (lean image)

```yaml
cache_enabled: true
cache_backend: sqlite    # persists across restarts
cache_similarity: 1.0   # exact match only, no sentence-transformers needed
cache_ttl: 3600
```

### Quick start — semantic matching (:semantic image)

```yaml
cache_enabled: true
cache_backend: sqlite
cache_similarity: 0.90   # hit if ≥90% cosine similarity
cache_ttl: 3600
cache_history_weight: 0.3
```

Pull the semantic image:
```bash
docker pull ghcr.io/nomyo-ai/nomyo-router:latest-semantic
```

### Cache configuration options

#### `cache_enabled`

**Type**: `bool` | **Default**: `false`

Enable or disable the cache. All other cache settings are ignored when `false`.

#### `cache_backend`

**Type**: `str` | **Default**: `"memory"`

| Value | Description | Persists | Multi-replica |
|---|---|---|---|
| `memory` | In-process LRU dict | ❌ | ❌ |
| `sqlite` | File-based via `aiosqlite` | ✅ | ❌ |
| `redis` | Redis via `redis.asyncio` | ✅ | ✅ |

Use `redis` when running multiple router replicas behind a load balancer — all replicas share one warm cache.

#### `cache_similarity`

**Type**: `float` | **Default**: `1.0`

Cosine similarity threshold. `1.0` means exact match only (no embedding model needed). Values below `1.0` enable semantic matching, which requires the `:semantic` Docker image tag.

Recommended starting value for semantic mode: `0.90`.

#### `cache_ttl`

**Type**: `int | null` | **Default**: `3600`

Time-to-live for cache entries in seconds. Remove the key or set to `null` to cache forever.

#### `cache_db_path`

**Type**: `str` | **Default**: `"llm_cache.db"`

Path to the SQLite cache database. Only used when `cache_backend: sqlite`.

#### `cache_redis_url`

**Type**: `str` | **Default**: `"redis://localhost:6379/0"`

Redis connection URL. Only used when `cache_backend: redis`.

#### `cache_history_weight`

**Type**: `float` | **Default**: `0.3`

Weight of the BM25-weighted chat-history embedding in the combined cache key vector. `0.3` means the history contributes 30% and the final user message contributes 70% of the similarity signal. Only used when `cache_similarity < 1.0`.

### Cache management endpoints

| Endpoint | Method | Description |
|---|---|---|
| `/api/cache/stats` | `GET` | Hit/miss counters, hit rate, current config |
| `/api/cache/invalidate` | `POST` | Clear all cache entries and reset counters |

```bash
# Check cache performance
curl http://localhost:12434/api/cache/stats

# Clear the cache
curl -X POST http://localhost:12434/api/cache/invalidate
```

Example stats response:
```json
{
  "enabled": true,
  "hits": 1547,
  "misses": 892,
  "hit_rate": 0.634,
  "semantic": true,
  "backend": "sqlite",
  "similarity_threshold": 0.9,
  "history_weight": 0.3
}
```

### Docker image variants

| Tag | Semantic cache | Image size |
|---|---|---|
| `latest` | ❌ exact match only | ~300 MB |
| `latest-semantic` | ✅ sentence-transformers + model pre-baked | ~800 MB |

Build locally:
```bash
# Lean (exact match)
docker build -t nomyo-router .

# Semantic (~500 MB larger, all-MiniLM-L6-v2 model baked in)
docker build --build-arg SEMANTIC_CACHE=true -t nomyo-router:semantic .
```

## Configuration Validation

The router validates the configuration at startup:

1. **Endpoint URLs**: Must be valid URLs
2. **API Keys**: Must be strings (can reference environment variables)
3. **Connection Limits**: Must be positive integers

If the configuration is invalid, the router will exit with an error message.

## Dynamic Configuration

The configuration is loaded at startup and cannot be changed without restarting the router. For production deployments, consider:

1. Using a configuration management system
2. Implementing a rolling restart strategy
3. Using environment variables for sensitive data

## Example Configurations

See the [examples](examples/) directory for ready-to-use configuration examples.


### Using the router API key

When `router_api_key`/`NOMYO_ROUTER_API_KEY` is set, clients must send it on every request:
- Header (recommended): Authorization: Bearer <router_key>
- Query param (fallback): ?api_key=<router_key>

Example:
```bash
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
```
-												feat:
added buffer_lock to prevent race condition in high concurrency scenarios
added documentation

											
										
										
											2026-01-05 17:16:31 +01:00
+								# Configuration Guide
 								## Configuration File
 								The NOMYO Router is configured via a YAML file (default: `config.yaml`). This file defines the Ollama endpoints, connection limits, and API keys.
 								### Basic Configuration
 								```yaml
 								# config.yaml
 								endpoints:
 								  - http://localhost:11434
 								  - http://ollama-server:11434
 								# Maximum concurrent connections *per endpoint‑model pair*
 								max_concurrent_connections: 2
-												add: Optional router-level API key that gates router/API/web UI access

Optional router-level API key that gates router/API/web UI access (leave empty to disable)

## Supplying the router API key

If you set `nomyo-router-api-key` in `config.yaml` (or `NOMYO_ROUTER_API_KEY` env), every request to NOMYO Router must include the key:

- HTTP header (recommended): `Authorization: Bearer <router_key>`
- Query param (fallback): `?api_key=<router_key>`

Examples:
```bash
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
curl "http://localhost:12434/api/tags?api_key=$NOMYO_ROUTER_API_KEY"
```

											
										
										
											2026-01-14 09:28:02 +01:00
 								# Optional router-level API key to secure the router and dashboard (leave blank to disable)
 								nomyo-router-api-key: ""
-												feat:
added buffer_lock to prevent race condition in high concurrency scenarios
added documentation

											
										
										
											2026-01-05 17:16:31 +01:00
+								```
 								### Complete Example
 								```yaml
 								# config.yaml
 								endpoints:
 								  - http://192.168.0.50:11434
 								  - http://192.168.0.51:11434
 								  - http://192.168.0.52:11434
 								  - https://api.openai.com/v1
 								# Maximum concurrent connections *per endpoint‑model pair* (equals to OLLAMA_NUM_PARALLEL)
 								max_concurrent_connections: 2
-												doc: primary routing and max_connections per endpoint added

											
										
										
											2026-05-01 13:55:29 +02:00
+								# Per-endpoint overrides — any field not listed falls back to the global default (optional)
 								# endpoint_config:
 								#   "http://192.168.0.50:11434":
 								#     max_concurrent_connections: 4
 								#   "http://192.168.0.51:11434":
 								#     max_concurrent_connections: 1
 								# Priority / WRR routing (optional, default: false)
 								# priority_routing: true
-												add: Optional router-level API key that gates router/API/web UI access

Optional router-level API key that gates router/API/web UI access (leave empty to disable)

## Supplying the router API key

If you set `nomyo-router-api-key` in `config.yaml` (or `NOMYO_ROUTER_API_KEY` env), every request to NOMYO Router must include the key:

- HTTP header (recommended): `Authorization: Bearer <router_key>`
- Query param (fallback): `?api_key=<router_key>`

Examples:
```bash
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
curl "http://localhost:12434/api/tags?api_key=$NOMYO_ROUTER_API_KEY"
```

											
										
										
											2026-01-14 09:28:02 +01:00
+								# Optional router-level API key to secure the router and dashboard (leave blank to disable)
 								nomyo-router-api-key: ""
-												feat:
added buffer_lock to prevent race condition in high concurrency scenarios
added documentation

											
										
										
											2026-01-05 17:16:31 +01:00
+								# API keys for remote endpoints
 								# Set an environment variable like OPENAI_KEY
 								# Confirm endpoints are exactly as in endpoints block
 								api_keys:
 								  "http://192.168.0.50:11434": "ollama"
 								  "http://192.168.0.51:11434": "ollama"
 								  "http://192.168.0.52:11434": "ollama"
 								  "https://api.openai.com/v1": "${OPENAI_KEY}"
 								```
 								## Configuration Options
 								### `endpoints`
 								**Type**: `list[str]`
 								**Description**: List of Ollama endpoint URLs. Can include both Ollama endpoints (`http://host:11434`) and OpenAI-compatible endpoints (`https://api.openai.com/v1`).
 								**Examples**:
 								```yaml
 								endpoints:
 								  - http://localhost:11434
 								  - http://ollama1:11434
 								  - http://ollama2:11434
 								  - https://api.openai.com/v1
 								  - https://api.anthropic.com/v1
 								```
 								**Notes**:
 								- Ollama endpoints use the standard `/api/` prefix
 								- OpenAI-compatible endpoints use `/v1` prefix
 								- The router automatically detects endpoint type based on URL pattern
 								### `max_concurrent_connections`
 								**Type**: `int`
 								**Default**: `1`
 								**Description**: Maximum number of concurrent connections allowed per endpoint-model pair. This corresponds to Ollama's `OLLAMA_NUM_PARALLEL` setting.
 								**Example**:
 								```yaml
 								max_concurrent_connections: 4
 								```
 								**Notes**:
 								- This setting controls how many requests can be processed simultaneously for a specific model on a specific endpoint
 								- When this limit is reached, the router will route requests to other endpoints with available capacity
 								- Higher values allow more parallel requests but may increase memory usage
-												doc: primary routing and max_connections per endpoint added

											
										
										
											2026-05-01 13:55:29 +02:00
+								### `endpoint_config`
 								**Type**: `dict[str, dict]` (optional)
 								**Default**: `{}` (all endpoints use the global `max_concurrent_connections`)
 								**Description**: Per-endpoint overrides for configuration values. The endpoint URL must match the entry in `endpoints` exactly. Any field not listed falls back to the global default.
 								**Supported per-endpoint fields**:
 								| Field | Description |
 								|---|---|
 								| `max_concurrent_connections` | Overrides the global limit for this endpoint only |
 								**Example**:
 								```yaml
 								endpoint_config:
 								  "http://192.168.0.50:11434":
 								    max_concurrent_connections: 4   # high-memory GPU node
 								  "http://192.168.0.51:11434":
 								    max_concurrent_connections: 1   # low-memory node
 								```
 								**Notes**:
 								- Useful when endpoints have different hardware capacity.
 								- The utilization ratio used by WRR (`priority_routing: true`) is computed per-endpoint using the effective limit, so a node with `max_concurrent_connections: 4` running 2 requests is considered 50% utilized, same as a node with limit 2 running 1 request.
 								---
 								### `priority_routing`
 								**Type**: `bool` (optional)
 								**Default**: `false`
 								**Description**: Selects the load-balancing algorithm used when multiple endpoints are available for a request.
 								| Value | Algorithm |
 								|---|---|
 								| `false` (default) | Random selection among equally-idle endpoints; otherwise pick the least-loaded endpoint by raw connection count. |
 								| `true` | **Weighted Round Robin (WRR)** — endpoints are ranked by utilization ratio (`active_connections / max_concurrent_connections`). Config order acts as the tiebreaker: the endpoint listed first in `endpoints` is preferred when two candidates have equal utilization. |
 								**Example**:
 								```yaml
 								priority_routing: true
 								```
 								**When to use WRR**:
 								- You have a primary GPU node and one or more fallback nodes, and want the primary to absorb all traffic until it is genuinely saturated.
 								- Combined with `endpoint_config` to give the primary a higher `max_concurrent_connections`, so the utilization ratio reflects real capacity rather than raw slot counts.
 								**Example — primary/fallback setup**:
 								```yaml
 								endpoints:
 								  - http://gpu-primary:11434    # preferred
 								  - http://gpu-secondary:11434  # fallback
 								endpoint_config:
 								  "http://gpu-primary:11434":
 								    max_concurrent_connections: 4
 								  "http://gpu-secondary:11434":
 								    max_concurrent_connections: 2
 								priority_routing: true
 								```
 								With this config the primary handles up to 4 concurrent requests before the secondary receives any traffic.
 								---
-												feat: visualization of conversation affinity in dashboard

											
										
										
											2026-05-13 13:38:37 +02:00
+								### `conversation_affinity`
 								**Type**: `bool` (optional)
 								**Default**: `false`
 								**Companion setting**: [`conversation_affinity_ttl`](#conversation_affinity_ttl)
 								**Description**: When enabled, the router prefers to send follow-up requests of the same conversation back to the endpoint that already served the first turn. This keeps the backend's prompt cache (the llama.cpp / Ollama **KV cache**) warm: the first user turn pays the cold prefill cost, every later turn reuses the same prefix and only generates new tokens. It is a **soft preference** — when the previously-chosen endpoint is no longer eligible (model unloaded, no free slot), the router falls back to the standard selection algorithm (`priority_routing` or random).
 								#### How a conversation is identified
 								The router does **not** track session IDs or auth tokens. It computes a stable fingerprint per request from:
 								```
 								SHA1(  model
 								     + every leading message with role="system"
 								     + the first message with role="user"  )
 								```
 								Anything after the first user turn is ignored — those later messages extend the same KV prefix, so they don't change the cache identity.
 								**What this means in practice**
 								| You send… | Fingerprint behaves like… |
 								|---|---|
 								| Turn 2 of the same chat (history grows but first system+user are unchanged) | **Same** as turn 1 → pin is reused and TTL refreshed |
 								| Turn 1 of a fresh chat | **New** fingerprint → new pin |
 								| Same first user prompt but a different model | **New** fingerprint (model is part of the hash) |
 								| Same chat but the client mutates the system prompt between turns (e.g. injects a fresh timestamp) | **New** fingerprint — the affinity will not stick |
 								#### TTL and refresh
 								Every time `choose_endpoint` returns a pinned endpoint, the entry's expiry is bumped to `now + conversation_affinity_ttl`. An idle conversation drops out of the map once that window elapses without traffic. Default 300 s matches Ollama's default `keep_alive` — once the backend has unloaded the model, the KV cache is gone too, so a stale pin would be pointless anyway.
 								#### Why the dashboard may show more than one dot per visible conversation
 								The fingerprint is computed per **HTTP request**, not per chat-window. Most chat UIs (Open WebUI in particular) fire several **auxiliary** requests alongside the real conversation:
 								- *Title generation* — synthetic system prompt + the user message as content
 								- *Follow-up question suggestion* — synthetic system prompt + the conversation as content
 								- *Tag generation*, *memory extraction*, *retrieval query rewriting*, etc.
 								Each of those has its own `(system + first user turn)` and therefore its own fingerprint and its own pin in [the affinity dot matrix](monitoring.md#affinity-stats-conversation-affinity). They all *correctly* refer to a real warm KV-cache prefix on the backend, so the routing they drive is right — they just don't visually map 1:1 to a user-perceived "conversation."
 								#### Example
 								```yaml
 								endpoints:
 								  - http://gpu-primary:11434
 								  - http://gpu-secondary:11434
 								conversation_affinity: true
 								conversation_affinity_ttl: 300
 								```
 								With this configuration, a chat that starts on `gpu-primary` will keep returning to `gpu-primary` for follow-up turns as long as the model is still loaded there and a slot is free, even if `gpu-secondary` happens to be more idle at that moment. Cold-prefill cost is paid once instead of once per turn.
 								#### When to enable
 								- ✅ Interactive chat workloads with long histories — the prefill savings on every follow-up turn are substantial.
 								- ✅ Multi-endpoint deployments where models are loaded on more than one node.
 								- ❌ Pure one-shot / single-turn workloads (no KV-cache to keep warm).
 								- ❌ When you specifically want strict load-balancing parity — affinity intentionally biases against perfect balance.
 								---
 								### `conversation_affinity_ttl`
 								**Type**: `int` (seconds, optional)
 								**Default**: `300`
 								**Description**: How long a conversation stays pinned to its endpoint after the last request that touched it. Refreshed on every reuse — so an actively-used conversation keeps its pin indefinitely; an abandoned one expires after `conversation_affinity_ttl` seconds of silence.
 								**Recommendation**: leave this aligned with the backend's `keep_alive` window. If the model is unloaded by the backend, the KV cache is gone and there is no benefit to keeping the pin.
 								**Example**:
 								```yaml
 								conversation_affinity: true
 								conversation_affinity_ttl: 600   # half an hour of inactivity before un-pinning
 								```
 								---
-												add: Optional router-level API key that gates router/API/web UI access

Optional router-level API key that gates router/API/web UI access (leave empty to disable)

## Supplying the router API key

If you set `nomyo-router-api-key` in `config.yaml` (or `NOMYO_ROUTER_API_KEY` env), every request to NOMYO Router must include the key:

- HTTP header (recommended): `Authorization: Bearer <router_key>`
- Query param (fallback): `?api_key=<router_key>`

Examples:
```bash
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
curl "http://localhost:12434/api/tags?api_key=$NOMYO_ROUTER_API_KEY"
```

											
										
										
											2026-01-14 09:28:02 +01:00
+								### `router_api_key`
 								**Type**: `str` (optional)
 								**Description**: Shared secret that gates access to the NOMYO Router APIs and dashboard. When set, clients must send `Authorization: Bearer <key>` or an `api_key` query parameter.
 								**Example**:
 								```yaml
 								nomyo-router-api-key: "super-secret-value"
 								```
 								**Notes**:
 								- Leave this blank or omit it to disable router-level authentication.
 								- You can also set the `NOMYO_ROUTER_API_KEY` environment variable to avoid storing the key in plain text.
-												feat:
added buffer_lock to prevent race condition in high concurrency scenarios
added documentation

											
										
										
											2026-01-05 17:16:31 +01:00
+								### `api_keys`
 								**Type**: `dict[str, str]`
 								**Description**: Mapping of endpoint URLs to API keys. Used for authenticating with remote endpoints.
 								**Example**:
 								```yaml
 								api_keys:
 								  "http://192.168.0.50:11434": "ollama"
 								  "https://api.openai.com/v1": "${OPENAI_KEY}"
 								```
 								**Environment Variables**:
 								- API keys can reference environment variables using `${VAR_NAME}` syntax
 								- The router will expand these references at startup
 								- Example: `${OPENAI_KEY}` will be replaced with the value of the `OPENAI_KEY` environment variable
 								## Environment Variables
 								### `NOMYO_ROUTER_CONFIG_PATH`
 								**Description**: Path to the configuration file. If not set, defaults to `config.yaml` in the current working directory.
 								**Example**:
 								```bash
 								export NOMYO_ROUTER_CONFIG_PATH=/etc/nomyo-router/config.yaml
 								```
 								### `NOMYO_ROUTER_DB_PATH`
 								**Description**: Path to the SQLite database file for storing token counts. If not set, defaults to `token_counts.db` in the current working directory.
 								**Example**:
 								```bash
 								export NOMYO_ROUTER_DB_PATH=/var/lib/nomyo-router/token_counts.db
 								```
-												add: Optional router-level API key that gates router/API/web UI access

Optional router-level API key that gates router/API/web UI access (leave empty to disable)

## Supplying the router API key

If you set `nomyo-router-api-key` in `config.yaml` (or `NOMYO_ROUTER_API_KEY` env), every request to NOMYO Router must include the key:

- HTTP header (recommended): `Authorization: Bearer <router_key>`
- Query param (fallback): `?api_key=<router_key>`

Examples:
```bash
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
curl "http://localhost:12434/api/tags?api_key=$NOMYO_ROUTER_API_KEY"
```

											
										
										
											2026-01-14 09:28:02 +01:00
+								### `NOMYO_ROUTER_API_KEY`
 								**Description**: Router-level API key. When set, all router endpoints and the dashboard require this key via `Authorization: Bearer <key>` or the `api_key` query parameter.
 								**Example**:
 								```bash
 								export NOMYO_ROUTER_API_KEY=your_router_api_key
 								```
-												feat:
added buffer_lock to prevent race condition in high concurrency scenarios
added documentation

											
										
										
											2026-01-05 17:16:31 +01:00
+								### API-Specific Keys
 								You can set API keys directly as environment variables:
 								```bash
 								export OPENAI_KEY=your_openai_api_key
 								export ANTHROPIC_KEY=your_anthropic_api_key
 								```
 								## Configuration Best Practices
 								### Multiple Ollama Instances
 								For a cluster of Ollama instances:
 								```yaml
 								endpoints:
 								  - http://ollama-worker1:11434
 								  - http://ollama-worker2:11434
 								  - http://ollama-worker3:11434
 								max_concurrent_connections: 2
 								```
 								**Recommendation**: Set `max_concurrent_connections` to match your Ollama instances' `OLLAMA_NUM_PARALLEL` setting.
 								### Mixed Endpoints
 								Combining Ollama and OpenAI endpoints:
 								```yaml
 								endpoints:
 								  - http://localhost:11434
 								  - https://api.openai.com/v1
 								api_keys:
 								  "https://api.openai.com/v1": "${OPENAI_KEY}"
 								```
 								**Note**: The router will automatically route requests based on model availability across all endpoints.
 								### High Availability
 								For production deployments:
 								```yaml
 								endpoints:
 								  - http://ollama-primary:11434
 								  - http://ollama-secondary:11434
 								  - http://ollama-tertiary:11434
 								max_concurrent_connections: 3
 								```
 								**Recommendation**: Use multiple endpoints for redundancy and load distribution.
-												doc: primary routing and max_connections per endpoint added

											
										
										
											2026-05-01 13:55:29 +02:00
+								### Priority Routing (Primary + Fallback)
 								When you have heterogeneous hardware and want to prefer a faster node:
 								```yaml
 								endpoints:
 								  - http://gpu-primary:11434      # high-VRAM node, listed first = highest priority
 								  - http://gpu-secondary:11434
 								endpoint_config:
 								  "http://gpu-primary:11434":
 								    max_concurrent_connections: 4
 								  "http://gpu-secondary:11434":
 								    max_concurrent_connections: 2
 								priority_routing: true
 								```
 								The router sends all requests to the primary until its utilization ratio reaches 100%, then spills over to the secondary. Without `priority_routing: true` the default behaviour is random selection among idle endpoints.
-												feat: adding a semantic cache layer

											
										
										
											2026-03-08 09:12:09 +01:00
+								## Semantic LLM Cache
 								NOMYO Router can cache LLM responses and serve them directly — skipping endpoint selection, model load, and token generation entirely.
 								### How it works
 . On every cacheable request (`/api/chat`, `/api/generate`, `/v1/chat/completions`, `/v1/completions`) the cache is checked **before** choosing an endpoint.
 . On a **cache hit** the stored response is returned immediately as a single chunk (streaming or non-streaming — both work).
 . On a **cache miss** the request is forwarded normally. The response is stored in the cache after it completes.
 . **MOE requests** (`moe-*` model prefix) always bypass the cache.
 . **Token counts** are never recorded for cache hits.
 								### Cache key strategy
 								| Signal | How matched |
 								|---|---|
 								| `model + system_prompt` | Exact — hard context isolation per deployment |
 								| BM25-weighted embedding of chat history | Semantic — conversation context signal |
 								| Embedding of last user message | Semantic — the actual question |
 								The two semantic vectors are combined as a weighted mean (tuned by `cache_history_weight`) before cosine similarity comparison, staying at a single 384-dimensional vector compatible with the library's storage format.
 								### Quick start — exact match (lean image)
 								```yaml
 								cache_enabled: true
 								cache_backend: sqlite    # persists across restarts
 								cache_similarity: 1.0   # exact match only, no sentence-transformers needed
 								cache_ttl: 3600
 								```
 								### Quick start — semantic matching (:semantic image)
 								```yaml
 								cache_enabled: true
 								cache_backend: sqlite
 								cache_similarity: 0.90   # hit if ≥90% cosine similarity
 								cache_ttl: 3600
 								cache_history_weight: 0.3
 								```
 								Pull the semantic image:
 								```bash
 								docker pull ghcr.io/nomyo-ai/nomyo-router:latest-semantic
 								```
 								### Cache configuration options
 								#### `cache_enabled`
 								**Type**: `bool` | **Default**: `false`
 								Enable or disable the cache. All other cache settings are ignored when `false`.
 								#### `cache_backend`
 								**Type**: `str` | **Default**: `"memory"`
 								| Value | Description | Persists | Multi-replica |
 								|---|---|---|---|
 								| `memory` | In-process LRU dict | ❌ | ❌ |
 								| `sqlite` | File-based via `aiosqlite` | ✅ | ❌ |
 								| `redis` | Redis via `redis.asyncio` | ✅ | ✅ |
 								Use `redis` when running multiple router replicas behind a load balancer — all replicas share one warm cache.
 								#### `cache_similarity`
 								**Type**: `float` | **Default**: `1.0`
 								Cosine similarity threshold. `1.0` means exact match only (no embedding model needed). Values below `1.0` enable semantic matching, which requires the `:semantic` Docker image tag.
 								Recommended starting value for semantic mode: `0.90`.
 								#### `cache_ttl`
 								**Type**: `int | null` | **Default**: `3600`
 								Time-to-live for cache entries in seconds. Remove the key or set to `null` to cache forever.
 								#### `cache_db_path`
 								**Type**: `str` | **Default**: `"llm_cache.db"`
 								Path to the SQLite cache database. Only used when `cache_backend: sqlite`.
 								#### `cache_redis_url`
 								**Type**: `str` | **Default**: `"redis://localhost:6379/0"`
 								Redis connection URL. Only used when `cache_backend: redis`.
 								#### `cache_history_weight`
 								**Type**: `float` | **Default**: `0.3`
 								Weight of the BM25-weighted chat-history embedding in the combined cache key vector. `0.3` means the history contributes 30% and the final user message contributes 70% of the similarity signal. Only used when `cache_similarity < 1.0`.
 								### Cache management endpoints
 								| Endpoint | Method | Description |
 								|---|---|---|
 								| `/api/cache/stats` | `GET` | Hit/miss counters, hit rate, current config |
 								| `/api/cache/invalidate` | `POST` | Clear all cache entries and reset counters |
 								```bash
 								# Check cache performance
 								curl http://localhost:12434/api/cache/stats
 								# Clear the cache
 								curl -X POST http://localhost:12434/api/cache/invalidate
 								```
 								Example stats response:
 								```json
 								{
 								  "enabled": true,
 								  "hits": 1547,
 								  "misses": 892,
 								  "hit_rate": 0.634,
 								  "semantic": true,
 								  "backend": "sqlite",
 								  "similarity_threshold": 0.9,
 								  "history_weight": 0.3
 								}
 								```
 								### Docker image variants
 								| Tag | Semantic cache | Image size |
 								|---|---|---|
 								| `latest` | ❌ exact match only | ~300 MB |
 								| `latest-semantic` | ✅ sentence-transformers + model pre-baked | ~800 MB |
 								Build locally:
 								```bash
 								# Lean (exact match)
 								docker build -t nomyo-router .
 								# Semantic (~500 MB larger, all-MiniLM-L6-v2 model baked in)
 								docker build --build-arg SEMANTIC_CACHE=true -t nomyo-router:semantic .
 								```
-												feat:
added buffer_lock to prevent race condition in high concurrency scenarios
added documentation

											
										
										
											2026-01-05 17:16:31 +01:00
+								## Configuration Validation
 								The router validates the configuration at startup:
 . **Endpoint URLs**: Must be valid URLs
 . **API Keys**: Must be strings (can reference environment variables)
 . **Connection Limits**: Must be positive integers
 								If the configuration is invalid, the router will exit with an error message.
 								## Dynamic Configuration
 								The configuration is loaded at startup and cannot be changed without restarting the router. For production deployments, consider:
 . Using a configuration management system
 . Implementing a rolling restart strategy
 . Using environment variables for sensitive data
 								## Example Configurations
 								See the [examples](examples/) directory for ready-to-use configuration examples.
-												add: Optional router-level API key that gates router/API/web UI access

Optional router-level API key that gates router/API/web UI access (leave empty to disable)

## Supplying the router API key

If you set `nomyo-router-api-key` in `config.yaml` (or `NOMYO_ROUTER_API_KEY` env), every request to NOMYO Router must include the key:

- HTTP header (recommended): `Authorization: Bearer <router_key>`
- Query param (fallback): `?api_key=<router_key>`

Examples:
```bash
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
curl "http://localhost:12434/api/tags?api_key=$NOMYO_ROUTER_API_KEY"
```

											
										
										
											2026-01-14 09:28:02 +01:00
 								### Using the router API key
 								When `router_api_key`/`NOMYO_ROUTER_API_KEY` is set, clients must send it on every request:
 								- Header (recommended): Authorization: Bearer <router_key>
 								- Query param (fallback): ?api_key=<router_key>
 								Example:
 								```bash
 								curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
 								```