nomyo-router/README.md

# NOMYO Router

is a transparent proxy for [Ollama](https://github.com/ollama/ollama) with model deployment aware routing.

[![Click for video](https://github.com/user-attachments/assets/ddacdf88-e3f3-41dd-8be6-f165b22d9879)](https://eu1.nomyo.ai/assets/dash.mp4)

It runs between your frontend application and Ollama backend and is transparent for both, the front- and backend.

![arch](https://github.com/user-attachments/assets/1e0064ab-de54-4226-8a15-c0fcca64704c)

# Installation

Copy/Clone the repository, edit the config.yaml by adding your Ollama backend servers and the max_concurrent_connections setting per endpoint. This equals to your OLLAMA_NUM_PARALLEL config settings.

```
# config.yaml
# Ollama or OpenAI API V1 endpoints
endpoints:
  - http://ollama0:11434
  - http://ollama1:11434
  - http://ollama2:11434
  - https://api.openai.com/v1

# llama.cpp server endpoints
llama_server_endpoints:
  - http://192.168.0.33:8889/v1

# Maximum concurrent connections *per endpoint‑model pair*
max_concurrent_connections: 2

# Optional router-level API key to lock down router + dashboard (leave empty to disable)
nomyo-router-api-key: ""

# API keys for remote endpoints
# Set an environment variable like OPENAI_KEY
# Confirm endpoints are exactly as in endpoints block
api_keys:
  "http://192.168.0.50:11434": "ollama"
  "http://192.168.0.51:11434": "ollama"
  "http://192.168.0.52:11434": "ollama"
  "https://api.openai.com/v1": "${OPENAI_KEY}"
  "http://192.168.0.33:8889/v1": "llama"
```

Run the NOMYO Router in a dedicated virtual environment, install the requirements and run with uvicorn:

```
python3 -m venv .venv/router
source .venv/router/bin/activate
pip3 install -r requirements.txt
```

[optional] on the shell do:

```
export OPENAI_KEY=YOUR_SECRET_API_KEY
# Optional: router-level key (clients must send Authorization: Bearer)
# export NOMYO_ROUTER_API_KEY=YOUR_ROUTER_KEY
```

finally you can

```
uvicorn router:app --host 127.0.0.1 --port 12434
```

in <u>very</u> high concurrent scenarios (> 500 simultaneous requests) you can also run with uvloop

```
uvicorn router:app --host 127.0.0.1 --port 12434 --loop uvloop
```

## Docker Deployment

### Pre-built image (GitHub Container Registry)

Pre-built multi-arch images (`linux/amd64`, `linux/arm64`) are published automatically on every release.

**Lean image** (exact-match cache, ~300 MB):
```sh
docker pull ghcr.io/nomyo-ai/nomyo-router:latest
docker pull ghcr.io/nomyo-ai/nomyo-router:0.7.0
```

**Semantic image** (semantic cache with `all-MiniLM-L6-v2` pre-baked, ~800 MB):
```sh
docker pull ghcr.io/nomyo-ai/nomyo-router:latest-semantic
docker pull ghcr.io/nomyo-ai/nomyo-router:0.7.0-semantic
```

### Build the container image locally

```sh
# Lean build (exact match cache, default)
docker build -t nomyo-router .

# Semantic build — sentence-transformers + model baked in
docker build --build-arg SEMANTIC_CACHE=true -t nomyo-router:semantic .
```

Run the router in Docker with your own configuration file mounted from the host. The entrypoint script accepts a `--config-path` argument so you can point to a file anywhere inside the container:

```sh
docker run -d \
  --name nomyo-router \
  -p 12434:12434 \
  -v /absolute/path/to/config_folder:/app/config/ \
  -e CONFIG_PATH /app/config/config.yaml
  nomyo-router \
```

Notes:

- `-e CONFIG_PATH` sets the `NOMYO_ROUTER_CONFIG_PATH` environment variable under the hood; you can export it directly instead if you prefer.
- To override the bind address or port, export `UVICORN_HOST` or `UVICORN_PORT`, or pass the corresponding uvicorn flags after `--`, e.g. `nomyo-router --config-path /config/config.yaml -- --port 9000`.
- Use `docker logs nomyo-router` to confirm the loaded endpoints and concurrency settings at startup.

# Routing

NOMYO Router accepts any Ollama request on the configured port for any Ollama endpoint from your frontend application. It then checks the available backends for the specific request.
When the request is embed(dings), chat or generate the request will be forwarded to a single Ollama server, answered and send back to the router which forwards it back to the frontend.

If another request for the same model config is made, NOMYO Router is aware which model runs on which Ollama server and routes the request to an Ollama server where this model is already deployed.

If at the same time there are more than max concurrent connections than configured, NOMYO Router will route this request to another Ollama server serving the requested model and having the least connections for fastest completion.

This way the Ollama backend servers are utilized more efficient than by simply using a wheighted, round-robin or least-connection approach.

![routing](https://github.com/user-attachments/assets/ed05dfbb-fcc8-4ff2-b8ca-3cdce2660c9f)

NOMYO Router also supports OpenAI API compatible v1 backend servers.

## Semantic LLM Cache

NOMYO Router includes an optional semantic cache that serves repeated or semantically similar LLM requests from cache — no endpoint round-trip, no token cost, response in <10 ms.

### Enable (exact match, any image)

```yaml
# config.yaml
cache_enabled: true
cache_backend: sqlite    # persists across restarts
cache_similarity: 1.0   # exact match only
cache_ttl: 3600
```

### Enable (semantic matching, :semantic image)

```yaml
cache_enabled: true
cache_backend: sqlite
cache_similarity: 0.90   # "What is Python?" ≈ "What's Python?" → cache hit
cache_ttl: 3600
cache_history_weight: 0.3
```

Pull the semantic image:
```bash
docker pull ghcr.io/nomyo-ai/nomyo-router:latest-semantic
```

### Cache key strategy

Each request is keyed on `model + system_prompt` (exact) combined with a weighted-mean embedding of BM25-weighted chat history (30%) and the last user message (70%). This means:
- Different system prompts → always separate cache namespaces (no cross-tenant leakage)
- Same question, different phrasing → cache hit (semantic mode)
- MOE requests (`moe-*`) → always bypass the cache

### Cached routes

`/api/chat` · `/api/generate` · `/v1/chat/completions` · `/v1/completions`

### Cache management

```bash
curl http://localhost:12434/api/cache/stats        # hit rate, counters, config
curl -X POST http://localhost:12434/api/cache/invalidate  # clear all entries
```

## Supplying the router API key

If you set `nomyo-router-api-key` in `config.yaml` (or `NOMYO_ROUTER_API_KEY` env), every request to NOMYO Router must include the key:

- HTTP header (recommended): `Authorization: Bearer <router_key>`
- Query param (fallback): `?api_key=<router_key>`

Examples:

```bash
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
curl "http://localhost:12434/api/tags?api_key=$NOMYO_ROUTER_API_KEY"
```
-												Update README.md
											
										
										
											2025-08-26 18:40:56 +02:00
+								# NOMYO Router
-												Update README.md
											
										
										
											2025-08-26 18:40:23 +02:00
-												Update README.md
											
										
										
											2025-08-27 09:22:07 +02:00
+								is a transparent proxy for [Ollama](https://github.com/ollama/ollama) with model deployment aware routing.
-												Update README.md
											
										
										
											2025-08-26 18:40:23 +02:00
-												Update video link to clickable thumbnail

Replace static video link with a clickable thumbnail.
											
										
										
											2026-01-26 18:34:30 +01:00
+								[![Click for video](https://github.com/user-attachments/assets/ddacdf88-e3f3-41dd-8be6-f165b22d9879)](https://eu1.nomyo.ai/assets/dash.mp4)
-												Update README.md
											
										
										
											2025-08-30 12:55:47 +02:00
-												Update README.md
											
										
										
											2025-08-26 18:40:23 +02:00
+								It runs between your frontend application and Ollama backend and is transparent for both, the front- and backend.
-												Update README.md
											
										
										
											2025-08-26 19:41:45 +02:00
+								![arch](https://github.com/user-attachments/assets/1e0064ab-de54-4226-8a15-c0fcca64704c)
-												Update README.md
											
										
										
											2025-08-26 18:40:23 +02:00
-												Update README.md
											
										
										
											2025-08-26 18:42:32 +02:00
+								# Installation
-												Update README.md
											
										
										
											2025-08-26 18:40:23 +02:00
+								Copy/Clone the repository, edit the config.yaml by adding your Ollama backend servers and the max_concurrent_connections setting per endpoint. This equals to your OLLAMA_NUM_PARALLEL config settings.
-												Update README.md
											
										
										
											2025-08-30 12:47:08 +02:00
+								```
 								# config.yaml
-												feat: add uvloop to requirements.txt as optional dependency to improve performance in high concurrent scenarios

											
										
										
											2026-03-03 10:31:10 +01:00
+								# Ollama or OpenAI API V1 endpoints
-												Update README.md
											
										
										
											2025-08-30 12:47:08 +02:00
+								endpoints:
 								  - http://ollama0:11434
 								  - http://ollama1:11434
 								  - http://ollama2:11434
-												Update README.md
											
										
										
											2025-09-15 12:06:42 +02:00
+								  - https://api.openai.com/v1
-												Update README.md
											
										
										
											2025-08-30 12:47:08 +02:00
-												feat: add uvloop to requirements.txt as optional dependency to improve performance in high concurrent scenarios

											
										
										
											2026-03-03 10:31:10 +01:00
+								# llama.cpp server endpoints
 								llama_server_endpoints:
 								  - http://192.168.0.33:8889/v1
-												Update README.md
											
										
										
											2025-08-30 12:47:08 +02:00
+								# Maximum concurrent connections *per endpoint‑model pair*
 								max_concurrent_connections: 2
-												Update README.md
											
										
										
											2025-09-15 12:06:42 +02:00
-												add: Optional router-level API key that gates router/API/web UI access

Optional router-level API key that gates router/API/web UI access (leave empty to disable)

## Supplying the router API key

If you set `nomyo-router-api-key` in `config.yaml` (or `NOMYO_ROUTER_API_KEY` env), every request to NOMYO Router must include the key:

- HTTP header (recommended): `Authorization: Bearer <router_key>`
- Query param (fallback): `?api_key=<router_key>`

Examples:
```bash
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
curl "http://localhost:12434/api/tags?api_key=$NOMYO_ROUTER_API_KEY"
```

											
										
										
											2026-01-14 09:28:02 +01:00
+								# Optional router-level API key to lock down router + dashboard (leave empty to disable)
 								nomyo-router-api-key: ""
-												Update README.md
											
										
										
											2025-09-15 12:06:42 +02:00
+								# API keys for remote endpoints
 								# Set an environment variable like OPENAI_KEY
 								# Confirm endpoints are exactly as in endpoints block
 								api_keys:
 								  "http://192.168.0.50:11434": "ollama"
 								  "http://192.168.0.51:11434": "ollama"
 								  "http://192.168.0.52:11434": "ollama"
 								  "https://api.openai.com/v1": "${OPENAI_KEY}"
-												feat: add uvloop to requirements.txt as optional dependency to improve performance in high concurrent scenarios

											
										
										
											2026-03-03 10:31:10 +01:00
+								  "http://192.168.0.33:8889/v1": "llama"
-												Update README.md
											
										
										
											2025-08-30 12:47:08 +02:00
+								```
-												Update README.md
											
										
										
											2025-08-26 18:40:23 +02:00
+								Run the NOMYO Router in a dedicated virtual environment, install the requirements and run with uvicorn:
 								```
 								python3 -m venv .venv/router
 								source .venv/router/bin/activate
-												Fix typo in install instructions

											
										
										
											2025-09-06 13:26:22 +02:00
+								pip3 install -r requirements.txt
-												Update README.md
											
										
										
											2025-08-26 18:40:23 +02:00
+								```
-												Update README.md
											
										
										
											2025-09-15 12:06:42 +02:00
-												Update README.md
											
										
										
											2025-10-31 13:54:22 +01:00
+								[optional] on the shell do:
-												Update README.md
											
										
										
											2025-09-15 12:06:42 +02:00
 								```
 								export OPENAI_KEY=YOUR_SECRET_API_KEY
-												add: Optional router-level API key that gates router/API/web UI access

Optional router-level API key that gates router/API/web UI access (leave empty to disable)

## Supplying the router API key

If you set `nomyo-router-api-key` in `config.yaml` (or `NOMYO_ROUTER_API_KEY` env), every request to NOMYO Router must include the key:

- HTTP header (recommended): `Authorization: Bearer <router_key>`
- Query param (fallback): `?api_key=<router_key>`

Examples:
```bash
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
curl "http://localhost:12434/api/tags?api_key=$NOMYO_ROUTER_API_KEY"
```

											
										
										
											2026-01-14 09:28:02 +01:00
+								# Optional: router-level key (clients must send Authorization: Bearer)
 								# export NOMYO_ROUTER_API_KEY=YOUR_ROUTER_KEY
-												Update README.md
											
										
										
											2025-09-15 12:06:42 +02:00
+								```
-												Update README.md
											
										
										
											2025-08-26 18:40:23 +02:00
+								finally you can
 								```
 								uvicorn router:app --host 127.0.0.1 --port 12434
 								```
-												feat: add uvloop to requirements.txt as optional dependency to improve performance in high concurrent scenarios

											
										
										
											2026-03-03 10:31:10 +01:00
+								in <u>very</u> high concurrent scenarios (> 500 simultaneous requests) you can also run with uvloop
 								```
 								uvicorn router:app --host 127.0.0.1 --port 12434 --loop uvloop
 								```
-												Add Docker support

Adds comprehensive docker support

											
										
										
											2025-11-07 13:59:16 +01:00
+								## Docker Deployment
-												docs: adding ghcr docker pull instructions

											
										
										
											2026-03-05 11:54:42 +01:00
+								### Pre-built image (GitHub Container Registry)
-												feat: adding a semantic cache layer

											
										
										
											2026-03-08 09:12:09 +01:00
+								Pre-built multi-arch images (`linux/amd64`, `linux/arm64`) are published automatically on every release.
-												docs: adding ghcr docker pull instructions

											
										
										
											2026-03-05 11:54:42 +01:00
-												feat: adding a semantic cache layer

											
										
										
											2026-03-08 09:12:09 +01:00
+								**Lean image** (exact-match cache, ~300 MB):
-												docs: adding ghcr docker pull instructions

											
										
										
											2026-03-05 11:54:42 +01:00
+								```sh
 								docker pull ghcr.io/nomyo-ai/nomyo-router:latest
-												feat: adding a semantic cache layer

											
										
										
											2026-03-08 09:12:09 +01:00
+								docker pull ghcr.io/nomyo-ai/nomyo-router:0.7.0
-												docs: adding ghcr docker pull instructions

											
										
										
											2026-03-05 11:54:42 +01:00
+								```
-												feat: adding a semantic cache layer

											
										
										
											2026-03-08 09:12:09 +01:00
+								**Semantic image** (semantic cache with `all-MiniLM-L6-v2` pre-baked, ~800 MB):
-												docs: adding ghcr docker pull instructions

											
										
										
											2026-03-05 11:54:42 +01:00
+								```sh
-												feat: adding a semantic cache layer

											
										
										
											2026-03-08 09:12:09 +01:00
+								docker pull ghcr.io/nomyo-ai/nomyo-router:latest-semantic
 								docker pull ghcr.io/nomyo-ai/nomyo-router:0.7.0-semantic
-												docs: adding ghcr docker pull instructions

											
										
										
											2026-03-05 11:54:42 +01:00
+								```
-												feat: adding a semantic cache layer

											
										
										
											2026-03-08 09:12:09 +01:00
+								### Build the container image locally
-												Add Docker support

Adds comprehensive docker support

											
										
										
											2025-11-07 13:59:16 +01:00
 								```sh
-												feat: adding a semantic cache layer

											
										
										
											2026-03-08 09:12:09 +01:00
+								# Lean build (exact match cache, default)
-												Add Docker support

Adds comprehensive docker support

											
										
										
											2025-11-07 13:59:16 +01:00
+								docker build -t nomyo-router .
-												feat: adding a semantic cache layer

											
										
										
											2026-03-08 09:12:09 +01:00
 								# Semantic build — sentence-transformers + model baked in
 								docker build --build-arg SEMANTIC_CACHE=true -t nomyo-router:semantic .
-												Add Docker support

Adds comprehensive docker support

											
										
										
											2025-11-07 13:59:16 +01:00
+								```
 								Run the router in Docker with your own configuration file mounted from the host. The entrypoint script accepts a `--config-path` argument so you can point to a file anywhere inside the container:
 								```sh
 								docker run -d \
 								  --name nomyo-router \
 								  -p 12434:12434 \
 								  -v /absolute/path/to/config_folder:/app/config/ \
 								  -e CONFIG_PATH /app/config/config.yaml
 								  nomyo-router \
 								```
 								Notes:
-												refactor: improve snapshot safety and usage tracking

Create atomic snapshots by deep copying usage data structures to prevent race conditions.
Protect concurrent reads of usage counts with explicit locking in endpoint selection.
Replace README screenshot with a video link.

											
										
										
											2026-01-26 17:18:57 +01:00
-												Add Docker support

Adds comprehensive docker support

											
										
										
											2025-11-07 13:59:16 +01:00
+								- `-e CONFIG_PATH` sets the `NOMYO_ROUTER_CONFIG_PATH` environment variable under the hood; you can export it directly instead if you prefer.
 								- To override the bind address or port, export `UVICORN_HOST` or `UVICORN_PORT`, or pass the corresponding uvicorn flags after `--`, e.g. `nomyo-router --config-path /config/config.yaml -- --port 9000`.
 								- Use `docker logs nomyo-router` to confirm the loaded endpoints and concurrency settings at startup.
-												Update README.md
											
										
										
											2025-08-26 18:40:23 +02:00
+								# Routing
-												Update README.md
											
										
										
											2025-08-26 19:51:36 +02:00
+								NOMYO Router accepts any Ollama request on the configured port for any Ollama endpoint from your frontend application. It then checks the available backends for the specific request.
 								When the request is embed(dings), chat or generate the request will be forwarded to a single Ollama server, answered and send back to the router which forwards it back to the frontend.
-												Update README.md
											
										
										
											2025-08-30 23:32:07 +02:00
+								If another request for the same model config is made, NOMYO Router is aware which model runs on which Ollama server and routes the request to an Ollama server where this model is already deployed.
-												Update README.md
											
										
										
											2025-08-26 19:51:36 +02:00
-												Update README.md
											
										
										
											2025-09-01 13:47:54 +02:00
+								If at the same time there are more than max concurrent connections than configured, NOMYO Router will route this request to another Ollama server serving the requested model and having the least connections for fastest completion.
-												Update README.md
											
										
										
											2025-08-26 19:51:36 +02:00
 								This way the Ollama backend servers are utilized more efficient than by simply using a wheighted, round-robin or least-connection approach.
-												Update README.md
											
										
										
											2025-08-26 18:42:32 +02:00
+								![routing](https://github.com/user-attachments/assets/ed05dfbb-fcc8-4ff2-b8ca-3cdce2660c9f)
-												Update README.md
											
										
										
											2025-08-30 12:49:47 +02:00
 								NOMYO Router also supports OpenAI API compatible v1 backend servers.
-												add: Optional router-level API key that gates router/API/web UI access

Optional router-level API key that gates router/API/web UI access (leave empty to disable)

## Supplying the router API key

If you set `nomyo-router-api-key` in `config.yaml` (or `NOMYO_ROUTER_API_KEY` env), every request to NOMYO Router must include the key:

- HTTP header (recommended): `Authorization: Bearer <router_key>`
- Query param (fallback): `?api_key=<router_key>`

Examples:
```bash
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
curl "http://localhost:12434/api/tags?api_key=$NOMYO_ROUTER_API_KEY"
```

											
										
										
											2026-01-14 09:28:02 +01:00
-												feat: adding a semantic cache layer

											
										
										
											2026-03-08 09:12:09 +01:00
+								## Semantic LLM Cache
 								NOMYO Router includes an optional semantic cache that serves repeated or semantically similar LLM requests from cache — no endpoint round-trip, no token cost, response in <10 ms.
 								### Enable (exact match, any image)
 								```yaml
 								# config.yaml
 								cache_enabled: true
 								cache_backend: sqlite    # persists across restarts
 								cache_similarity: 1.0   # exact match only
 								cache_ttl: 3600
 								```
 								### Enable (semantic matching, :semantic image)
 								```yaml
 								cache_enabled: true
 								cache_backend: sqlite
 								cache_similarity: 0.90   # "What is Python?" ≈ "What's Python?" → cache hit
 								cache_ttl: 3600
 								cache_history_weight: 0.3
 								```
 								Pull the semantic image:
 								```bash
 								docker pull ghcr.io/nomyo-ai/nomyo-router:latest-semantic
 								```
 								### Cache key strategy
 								Each request is keyed on `model + system_prompt` (exact) combined with a weighted-mean embedding of BM25-weighted chat history (30%) and the last user message (70%). This means:
 								- Different system prompts → always separate cache namespaces (no cross-tenant leakage)
 								- Same question, different phrasing → cache hit (semantic mode)
 								- MOE requests (`moe-*`) → always bypass the cache
 								### Cached routes
 								`/api/chat` · `/api/generate` · `/v1/chat/completions` · `/v1/completions`
 								### Cache management
 								```bash
 								curl http://localhost:12434/api/cache/stats        # hit rate, counters, config
 								curl -X POST http://localhost:12434/api/cache/invalidate  # clear all entries
 								```
-												add: Optional router-level API key that gates router/API/web UI access

Optional router-level API key that gates router/API/web UI access (leave empty to disable)

## Supplying the router API key

If you set `nomyo-router-api-key` in `config.yaml` (or `NOMYO_ROUTER_API_KEY` env), every request to NOMYO Router must include the key:

- HTTP header (recommended): `Authorization: Bearer <router_key>`
- Query param (fallback): `?api_key=<router_key>`

Examples:
```bash
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
curl "http://localhost:12434/api/tags?api_key=$NOMYO_ROUTER_API_KEY"
```

											
										
										
											2026-01-14 09:28:02 +01:00
+								## Supplying the router API key
 								If you set `nomyo-router-api-key` in `config.yaml` (or `NOMYO_ROUTER_API_KEY` env), every request to NOMYO Router must include the key:
 								- HTTP header (recommended): `Authorization: Bearer <router_key>`
 								- Query param (fallback): `?api_key=<router_key>`
 								Examples:
-												feat: add uvloop to requirements.txt as optional dependency to improve performance in high concurrent scenarios

											
										
										
											2026-03-03 10:31:10 +01:00
-												add: Optional router-level API key that gates router/API/web UI access

Optional router-level API key that gates router/API/web UI access (leave empty to disable)

## Supplying the router API key

If you set `nomyo-router-api-key` in `config.yaml` (or `NOMYO_ROUTER_API_KEY` env), every request to NOMYO Router must include the key:

- HTTP header (recommended): `Authorization: Bearer <router_key>`
- Query param (fallback): `?api_key=<router_key>`

Examples:
```bash
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
curl "http://localhost:12434/api/tags?api_key=$NOMYO_ROUTER_API_KEY"
```

											
										
										
											2026-01-14 09:28:02 +01:00
+								```bash
 								curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
 								curl "http://localhost:12434/api/tags?api_key=$NOMYO_ROUTER_API_KEY"
 								```