196 lines
6.8 KiB
Markdown
196 lines
6.8 KiB
Markdown
# NOMYO Router
|
||
|
||
is a transparent proxy for inference engines, i.e. [Ollama](https://github.com/ollama/ollama), [llama.cpp](https://github.com/ggml-org/llama.cpp/), [vllm](https://github.com/vllm-project/vllm) or any OpenAI V1 compatible endpoint with model deployment aware routing.
|
||
|
||
[](https://eu1.nomyo.ai/assets/dash.mp4)
|
||
|
||
It runs between your frontend application and Ollama backend and is transparent for both, the front- and backend.
|
||
|
||

|
||
|
||
# Installation
|
||
|
||
Copy/Clone the repository, edit the config.yaml by adding your Ollama backend servers and the max_concurrent_connections setting per endpoint. This equals to your OLLAMA_NUM_PARALLEL config settings.
|
||
|
||
```
|
||
# config.yaml
|
||
# Ollama or OpenAI API V1 endpoints
|
||
endpoints:
|
||
- http://ollama0:11434
|
||
- http://ollama1:11434
|
||
- http://ollama2:11434
|
||
- https://api.openai.com/v1
|
||
|
||
# llama.cpp server endpoints
|
||
llama_server_endpoints:
|
||
- http://192.168.0.33:8889/v1
|
||
|
||
# Maximum concurrent connections *per endpoint‑model pair*
|
||
max_concurrent_connections: 2
|
||
|
||
# Optional router-level API key to lock down router + dashboard (leave empty to disable)
|
||
nomyo-router-api-key: ""
|
||
|
||
# API keys for remote endpoints
|
||
# Set an environment variable like OPENAI_KEY
|
||
# Confirm endpoints are exactly as in endpoints block
|
||
api_keys:
|
||
"http://192.168.0.50:11434": "ollama"
|
||
"http://192.168.0.51:11434": "ollama"
|
||
"http://192.168.0.52:11434": "ollama"
|
||
"https://api.openai.com/v1": "${OPENAI_KEY}"
|
||
"http://192.168.0.33:8889/v1": "llama"
|
||
```
|
||
|
||
Run the NOMYO Router in a dedicated virtual environment, install the requirements and run with uvicorn:
|
||
|
||
```
|
||
python3 -m venv .venv/router
|
||
source .venv/router/bin/activate
|
||
pip3 install -r requirements.txt
|
||
```
|
||
|
||
[optional] on the shell do:
|
||
|
||
```
|
||
export OPENAI_KEY=YOUR_SECRET_API_KEY
|
||
# Optional: router-level key (clients must send Authorization: Bearer)
|
||
# export NOMYO_ROUTER_API_KEY=YOUR_ROUTER_KEY
|
||
```
|
||
|
||
finally you can
|
||
|
||
```
|
||
uvicorn router:app --host 127.0.0.1 --port 12434
|
||
```
|
||
|
||
in <u>very</u> high concurrent scenarios (> 500 simultaneous requests) you can also run with uvloop
|
||
|
||
```
|
||
uvicorn router:app --host 127.0.0.1 --port 12434 --loop uvloop
|
||
```
|
||
|
||
## Docker Deployment
|
||
|
||
### Pre-built image (GitHub Container Registry)
|
||
|
||
Pre-built multi-arch images (`linux/amd64`, `linux/arm64`) are published automatically on every release.
|
||
|
||
**Lean image** (exact-match cache, ~300 MB):
|
||
|
||
```sh
|
||
docker pull bitfreedom.net/nomyo-ai/nomyo-router:latest
|
||
docker pull bitfreedom.net/nomyo-ai/nomyo-router:0.7
|
||
```
|
||
|
||
**Semantic image** (semantic cache with `all-MiniLM-L6-v2` pre-baked, ~800 MB):
|
||
|
||
```sh
|
||
docker pull bitfreedom.net/nomyo-ai/nomyo-router:latest-semantic
|
||
docker pull bitfreedom.net/nomyo-ai/nomyo-router:0.7-semantic
|
||
```
|
||
|
||
### Build the container image locally
|
||
|
||
```sh
|
||
# Lean build (exact match cache, default)
|
||
docker build -t nomyo-router .
|
||
|
||
# Semantic build — sentence-transformers + model baked in
|
||
docker build --build-arg SEMANTIC_CACHE=true -t nomyo-router:semantic .
|
||
```
|
||
|
||
Run the router in Docker with your own configuration file mounted from the host. The entrypoint script accepts a `--config-path` argument so you can point to a file anywhere inside the container:
|
||
|
||
```sh
|
||
docker run -d \
|
||
--name nomyo-router \
|
||
-p 12434:12434 \
|
||
-v /absolute/path/to/config_folder:/app/config/ \
|
||
-e CONFIG_PATH /app/config/config.yaml
|
||
nomyo-router \
|
||
```
|
||
|
||
Notes:
|
||
|
||
- `-e CONFIG_PATH` sets the `NOMYO_ROUTER_CONFIG_PATH` environment variable under the hood; you can export it directly instead if you prefer.
|
||
- To override the bind address or port, export `UVICORN_HOST` or `UVICORN_PORT`, or pass the corresponding uvicorn flags after `--`, e.g. `nomyo-router --config-path /config/config.yaml -- --port 9000`.
|
||
- Use `docker logs nomyo-router` to confirm the loaded endpoints and concurrency settings at startup.
|
||
|
||
# Routing
|
||
|
||
NOMYO Router accepts any Ollama request on the configured port for any Ollama endpoint from your frontend application. It then checks the available backends for the specific request.
|
||
When the request is embed(dings), chat or generate the request will be forwarded to a single Ollama server, answered and send back to the router which forwards it back to the frontend.
|
||
|
||
If another request for the same model config is made, NOMYO Router is aware which model runs on which Ollama server and routes the request to an Ollama server where this model is already deployed.
|
||
|
||
If at the same time there are more than max concurrent connections than configured, NOMYO Router will route this request to another Ollama server serving the requested model and having the least connections for fastest completion.
|
||
|
||
This way the Ollama backend servers are utilized more efficient than by simply using a wheighted, round-robin or least-connection approach.
|
||
|
||

|
||
|
||
NOMYO Router also supports OpenAI API compatible v1 backend servers.
|
||
|
||
## Semantic LLM Cache
|
||
|
||
NOMYO Router includes an optional semantic cache that serves repeated or semantically similar LLM requests from cache — no endpoint round-trip, no token cost, response in <10 ms.
|
||
|
||
### Enable (exact match, any image)
|
||
|
||
```yaml
|
||
# config.yaml
|
||
cache_enabled: true
|
||
cache_backend: sqlite # persists across restarts
|
||
cache_similarity: 1.0 # exact match only
|
||
cache_ttl: 3600
|
||
```
|
||
|
||
### Enable (semantic matching, :semantic image)
|
||
|
||
```yaml
|
||
cache_enabled: true
|
||
cache_backend: sqlite
|
||
cache_similarity: 0.90 # "What is Python?" ≈ "What's Python?" → cache hit
|
||
cache_ttl: 3600
|
||
cache_history_weight: 0.3
|
||
```
|
||
|
||
Pull the semantic image:
|
||
|
||
```bash
|
||
docker pull ghcr.io/nomyo-ai/nomyo-router:latest-semantic
|
||
```
|
||
|
||
### Cache key strategy
|
||
|
||
Each request is keyed on `model + system_prompt` (exact) combined with a weighted-mean embedding of BM25-weighted chat history (30%) and the last user message (70%). This means:
|
||
|
||
- Different system prompts → always separate cache namespaces (no cross-tenant leakage)
|
||
- Same question, different phrasing → cache hit (semantic mode)
|
||
- MOE requests (`moe-*`) → always bypass the cache
|
||
|
||
### Cached routes
|
||
|
||
`/api/chat` · `/api/generate` · `/v1/chat/completions` · `/v1/completions`
|
||
|
||
### Cache management
|
||
|
||
```bash
|
||
curl http://localhost:12434/api/cache/stats # hit rate, counters, config
|
||
curl -X POST http://localhost:12434/api/cache/invalidate # clear all entries
|
||
```
|
||
|
||
## Supplying the router API key
|
||
|
||
If you set `nomyo-router-api-key` in `config.yaml` (or `NOMYO_ROUTER_API_KEY` env), every request to NOMYO Router must include the key:
|
||
|
||
- HTTP header (recommended): `Authorization: Bearer <router_key>`
|
||
- Query param (fallback): `?api_key=<router_key>`
|
||
|
||
Examples:
|
||
|
||
```bash
|
||
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
|
||
curl "http://localhost:12434/api/tags?api_key=$NOMYO_ROUTER_API_KEY"
|
||
```
|