Compare commits

...
Sign in to create a new pull request.

32 commits

Author SHA1 Message Date
0d5b8110f7 Merge pull request 'dev-v0.8.x' (#26) from dev-v0.8.x into main
All checks were successful
Build and Publish Docker Image (Semantic Cache) / build (amd64, linux/amd64, docker-amd64) (push) Successful in 3m42s
Build and Publish Docker Image / build (amd64, linux/amd64, docker-amd64) (push) Successful in 1m29s
Build and Publish Docker Image / build (arm64, linux/arm64, docker-arm64) (push) Successful in 11m53s
Build and Publish Docker Image / merge (push) Successful in 32s
Build and Publish Docker Image (Semantic Cache) / build (arm64, linux/arm64, docker-arm64) (push) Successful in 10m0s
Build and Publish Docker Image (Semantic Cache) / merge (push) Successful in 33s
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/26
2026-05-01 16:48:07 +02:00
ceeba9a0cd
doc: primary routing and max_connections per endpoint added 2026-05-01 13:55:29 +02:00
182ddae539
fix: prevent dashboard and route hangs when endpoints are down by calling skip_error_cache also with reduced timeout 2026-05-01 13:49:34 +02:00
bbe7bd48c5
feat: populate error cache also from endpoint_details if necessary 2026-04-29 17:03:32 +02:00
5797615736
feat: enhance load balancing #23 2026-04-22 17:27:34 +02:00
3a0ccbef7b
feat: update on gitignore 2026-04-17 16:56:00 +02:00
5c64286b70
fix: correct user socket path 2026-04-17 13:31:19 +02:00
11637c9143
feat: support localhost llama_server access via unix sockets 2026-04-17 12:41:57 +02:00
d2c461d821
doc: update readme 2026-04-16 13:56:34 +02:00
a3928c9c33 Merge pull request 'dev-v0.7.x -> main' (#25) from dev-v0.7.x into main
All checks were successful
Build and Publish Docker Image (Semantic Cache) / build (amd64, linux/amd64, docker-amd64) (push) Successful in 34s
Build and Publish Docker Image / build (amd64, linux/amd64, docker-amd64) (push) Successful in 38s
Build and Publish Docker Image (Semantic Cache) / build (arm64, linux/arm64, docker-arm64) (push) Successful in 10m5s
Build and Publish Docker Image (Semantic Cache) / merge (push) Successful in 31s
Build and Publish Docker Image / build (arm64, linux/arm64, docker-arm64) (push) Successful in 10m11s
Build and Publish Docker Image / merge (push) Successful in 31s
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/25
2026-04-16 12:27:34 +02:00
f4b3a09151 Merge pull request 'dev-v0.7.x -> main' (#22) from dev-v0.7.x into main
All checks were successful
Build and Publish Docker Image (Semantic Cache) / build (amd64, linux/amd64, docker-amd64) (push) Successful in 37s
Build and Publish Docker Image / build (amd64, linux/amd64, docker-amd64) (push) Successful in 36s
Build and Publish Docker Image (Semantic Cache) / build (arm64, linux/arm64, docker-arm64) (push) Successful in 10m10s
Build and Publish Docker Image (Semantic Cache) / merge (push) Successful in 33s
Build and Publish Docker Image / build (arm64, linux/arm64, docker-arm64) (push) Successful in 10m2s
Build and Publish Docker Image / merge (push) Successful in 33s
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/22
2026-04-13 14:00:20 +02:00
07b80e654f Merge pull request 'dev-v0.7.x' (#21) from dev-v0.7.x into main
All checks were successful
Build and Publish Docker Image (Semantic Cache) / build (amd64, linux/amd64, docker-amd64) (push) Successful in 35s
Build and Publish Docker Image / build (amd64, linux/amd64, docker-amd64) (push) Successful in 36s
Build and Publish Docker Image (Semantic Cache) / build (arm64, linux/arm64, docker-arm64) (push) Successful in 10m21s
Build and Publish Docker Image (Semantic Cache) / merge (push) Successful in 32s
Build and Publish Docker Image / build (arm64, linux/arm64, docker-arm64) (push) Successful in 10m3s
Build and Publish Docker Image / merge (push) Successful in 32s
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/21
2026-04-08 14:01:54 +02:00
0bf91a6dd0 Merge pull request 'dev-v0.7.x' (#20) from dev-v0.7.x into main
All checks were successful
Build and Publish Docker Image / build (amd64, linux/amd64, docker-amd64) (push) Successful in 1m7s
Build and Publish Docker Image / build (arm64, linux/arm64, docker-arm64) (push) Successful in 11m34s
Build and Publish Docker Image / merge (push) Successful in 36s
Build and Publish Docker Image (Semantic Cache) / build (amd64, linux/amd64, docker-amd64) (push) Successful in 35s
Build and Publish Docker Image (Semantic Cache) / build (arm64, linux/arm64, docker-arm64) (push) Successful in 14m38s
Build and Publish Docker Image (Semantic Cache) / merge (push) Successful in 47s
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/20
2026-04-07 17:56:32 +02:00
4086b1eab8 Merge pull request 'dev-v0.7.x -> prod' (#19) from dev-v0.7.x into main
All checks were successful
Build and Publish Docker Image (Semantic Cache) / build (amd64, linux/amd64, docker-amd64) (push) Successful in 37s
Build and Publish Docker Image / build (amd64, linux/amd64, docker-amd64) (push) Successful in 35s
Build and Publish Docker Image (Semantic Cache) / build (arm64, linux/arm64, docker-arm64) (push) Successful in 10m8s
Build and Publish Docker Image (Semantic Cache) / merge (push) Successful in 34s
Build and Publish Docker Image / build (arm64, linux/arm64, docker-arm64) (push) Successful in 10m4s
Build and Publish Docker Image / merge (push) Successful in 34s
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/19
2026-04-07 16:25:56 +02:00
73f821d737 Merge pull request 'dev-v0.7.x' (#18) from dev-v0.7.x into main
All checks were successful
Build and Publish Docker Image (Semantic Cache) / build (amd64, linux/amd64, docker-amd64) (push) Successful in 37s
Build and Publish Docker Image / build (amd64, linux/amd64, docker-amd64) (push) Successful in 37s
Build and Publish Docker Image (Semantic Cache) / build (arm64, linux/arm64, docker-arm64) (push) Successful in 10m2s
Build and Publish Docker Image (Semantic Cache) / merge (push) Successful in 31s
Build and Publish Docker Image / build (arm64, linux/arm64, docker-arm64) (push) Successful in 10m20s
Build and Publish Docker Image / merge (push) Successful in 32s
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/18
2026-04-06 10:19:28 +02:00
bf41902f31 Merge pull request 'dev-v0.7.x' (#17) from dev-v0.7.x into main
Some checks failed
Build and Publish Docker Image (Semantic Cache) / build (amd64, linux/amd64, docker-amd64) (push) Successful in 2m0s
Build and Publish Docker Image / build (amd64, linux/amd64, docker-amd64) (push) Failing after 58s
Build and Publish Docker Image (Semantic Cache) / build (arm64, linux/arm64, docker-arm64) (push) Successful in 13m15s
Build and Publish Docker Image (Semantic Cache) / merge (push) Successful in 47s
Build and Publish Docker Image / build (arm64, linux/arm64, docker-arm64) (push) Successful in 11m33s
Build and Publish Docker Image / merge (push) Has been skipped
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/17
2026-04-05 19:02:01 +02:00
23f66b7c0e Merge pull request 'dev-v0.7.x' (#16) from dev-v0.7.x into main
Some checks failed
Build and Publish Docker Image (Semantic Cache) / build (amd64, linux/amd64, docker-amd64) (push) Failing after 3m34s
Build and Publish Docker Image / build (amd64, linux/amd64, docker-amd64) (push) Successful in 1m30s
Build and Publish Docker Image (Semantic Cache) / build (arm64, linux/arm64, docker-arm64) (push) Successful in 14m44s
Build and Publish Docker Image (Semantic Cache) / merge (push) Has been skipped
Build and Publish Docker Image / build (arm64, linux/arm64, docker-arm64) (push) Successful in 11m59s
Build and Publish Docker Image / merge (push) Failing after 29s
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/16
2026-04-05 18:08:31 +02:00
d78c85dc48 Merge pull request 'fix: mergeblock missing container block' (#15) from dev-v0.7.x into main
Some checks failed
Build and Publish Docker Image (Semantic Cache) / build (linux/amd64, docker-amd64) (push) Successful in 1m59s
Build and Publish Docker Image / build (linux/amd64, docker-amd64) (push) Failing after 58s
Build and Publish Docker Image (Semantic Cache) / build (linux/arm64, docker-arm64) (push) Successful in 13m19s
Build and Publish Docker Image (Semantic Cache) / merge (push) Successful in 32s
Build and Publish Docker Image / build (linux/arm64, docker-arm64) (push) Successful in 12m11s
Build and Publish Docker Image / merge (push) Has been skipped
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/15
2026-04-05 09:54:41 +02:00
de7a97514a Merge pull request 'fix: workflow once more' (#14) from dev-v0.7.x into main
Some checks failed
Build and Publish Docker Image (Semantic Cache) / build (linux/amd64, docker-amd64) (push) Successful in 2m1s
Build and Publish Docker Image / build (linux/amd64, docker-amd64) (push) Successful in 1m5s
Build and Publish Docker Image (Semantic Cache) / build (linux/arm64, docker-arm64) (push) Successful in 13m28s
Build and Publish Docker Image (Semantic Cache) / merge (push) Failing after 15s
Build and Publish Docker Image / merge (push) Has been cancelled
Build and Publish Docker Image / build (linux/arm64, docker-arm64) (push) Has been cancelled
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/14
2026-04-05 09:33:32 +02:00
83f878dfc5 Merge pull request 'feat: add matrix build' (#13) from dev-v0.7.x into main
Some checks failed
Build and Publish Docker Image (Semantic Cache) / build (linux/amd64, docker-amd64) (push) Failing after 3m38s
Build and Publish Docker Image / build (linux/amd64, docker-amd64) (push) Failing after 1m35s
Build and Publish Docker Image (Semantic Cache) / build (linux/arm64, docker-arm64) (push) Failing after 14m42s
Build and Publish Docker Image (Semantic Cache) / merge (push) Has been skipped
Build and Publish Docker Image / build (linux/arm64, docker-arm64) (push) Failing after 12m1s
Build and Publish Docker Image / merge (push) Has been skipped
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/13
2026-04-04 22:47:31 +02:00
7a667c3a08 Merge pull request 'fix: added PAT REGISTRY_TOKEN' (#12) from dev-v0.7.x into main
All checks were successful
Build and Publish Docker Image (Semantic Cache) / build-and-push-semantic (push) Successful in 43m35s
Build and Publish Docker Image / build-and-push (push) Successful in 27m49s
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/12
2026-04-02 19:27:38 +02:00
50c73c2f6e Merge pull request 'fix: secrets' (#11) from dev-v0.7.x into main
Some checks failed
Build and Publish Docker Image (Semantic Cache) / build-and-push-semantic (push) Failing after 43m26s
Build and Publish Docker Image / build-and-push (push) Has been cancelled
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/11
2026-04-02 18:43:24 +02:00
0e6fc72324 Merge pull request 'fix: buildkit container network access' (#10) from dev-v0.7.x into main
Some checks failed
Build and Publish Docker Image / build-and-push (push) Has been cancelled
Build and Publish Docker Image (Semantic Cache) / build-and-push-semantic (push) Has been cancelled
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/10
2026-04-02 17:38:33 +02:00
e66d5f5fb1 Merge pull request 'fix: add dns' (#9) from dev-v0.7.x into main
Some checks failed
Build and Publish Docker Image / build-and-push (push) Has been cancelled
Build and Publish Docker Image (Semantic Cache) / build-and-push-semantic (push) Has been cancelled
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/9
2026-04-02 17:18:02 +02:00
bd51490659 Merge pull request 'fix: revert registry url' (#8) from dev-v0.7.x into main
Some checks failed
Build and Publish Docker Image (Semantic Cache) / build-and-push-semantic (push) Failing after 20m2s
Build and Publish Docker Image / build-and-push (push) Has been cancelled
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/8
2026-04-02 16:57:01 +02:00
c15f7e0be6 Merge pull request 'fix: correct registry path' (#7) from dev-v0.7.x into main
Some checks failed
Build and Publish Docker Image (Semantic Cache) / build-and-push-semantic (push) Failing after 19m18s
Build and Publish Docker Image / build-and-push (push) Failing after 19m17s
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/7
2026-04-02 14:51:38 +02:00
62b2eed04d Merge pull request 'fix: remove dind' (#6) from dev-v0.7.x into main
Some checks failed
Build and Publish Docker Image (Semantic Cache) / build-and-push-semantic (push) Failing after 19m16s
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/6
2026-04-02 13:43:27 +02:00
c3b1b0f864 Merge pull request 'fix: wait for docker daemon for docker info to succeed' (#5) from dev-v0.7.x into main
Some checks failed
Build and Publish Docker Image (Semantic Cache) / build-and-push-semantic (push) Failing after 1m21s
Build and Publish Docker Image / build-and-push (push) Failing after 1m20s
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/5
2026-04-02 13:40:02 +02:00
ac6bcbf296 Merge pull request 'fix: bypass automatic repo base_url' (#4) from dev-v0.7.x into main
Some checks failed
Build and Publish Docker Image (Semantic Cache) / build-and-push-semantic (push) Failing after 24s
Build and Publish Docker Image / build-and-push (push) Failing after 23s
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/4
2026-04-02 13:36:31 +02:00
748bb1e932 Merge pull request 'fix: repo url' (#3) from dev-v0.7.x into main
Some checks failed
Build and Publish Docker Image (Semantic Cache) / build-and-push-semantic (push) Failing after 53s
Build and Publish Docker Image / build-and-push (push) Failing after 51s
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/3
2026-04-02 13:31:08 +02:00
18f99b402b Merge pull request 'feat: add forgejo workflows' (#2) from dev-v0.7.x into main
Some checks failed
Build and Publish Docker Image (Semantic Cache) / build-and-push-semantic (push) Failing after 2m30s
Build and Publish Docker Image / build-and-push (push) Failing after 49s
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/2
2026-04-02 13:13:58 +02:00
ba1b2fb651 Merge pull request 'dev-v0.7.x to prod' (#1) from dev-v0.7.x into main
Some checks failed
Build and Publish Docker Image (Semantic Cache) / build-and-push-semantic (push) Has been cancelled
Build and Publish Docker Image / build-and-push (push) Has been cancelled
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/1
2026-04-02 09:17:59 +02:00
5 changed files with 304 additions and 55 deletions

3
.gitignore vendored
View file

@ -67,3 +67,6 @@ config.yaml
*.db*
*settings.json
# Test suite (local only, not committed yet)
test/

View file

@ -1,6 +1,6 @@
# NOMYO Router
is a transparent proxy for [Ollama](https://github.com/ollama/ollama) with model deployment aware routing.
is a transparent proxy for inference engines, i.e. [Ollama](https://github.com/ollama/ollama), [llama.cpp](https://github.com/ggml-org/llama.cpp/), [vllm](https://github.com/vllm-project/vllm) or any OpenAI V1 compatible endpoint with model deployment aware routing.
[![Click for video](https://github.com/user-attachments/assets/ddacdf88-e3f3-41dd-8be6-f165b22d9879)](https://eu1.nomyo.ai/assets/dash.mp4)

View file

@ -9,8 +9,23 @@ llama_server_endpoints:
- http://192.168.0.50:8889/v1
# Maximum concurrent connections *per endpointmodel pair* (equals to OLLAMA_NUM_PARALLEL)
# This is the global default; individual endpoints can override it via endpoint_config below.
max_concurrent_connections: 2
# Per-endpoint overrides (optional). Any field not listed falls back to the global default.
# endpoint_config:
# "http://192.168.0.50:11434":
# max_concurrent_connections: 3
# "http://192.168.0.51:11434":
# max_concurrent_connections: 1
# Priority / WRR routing (optional, default: false).
# When true, requests are routed by utilization ratio (usage/max_concurrent_connections)
# and the config order of endpoints acts as the tiebreaker — the first endpoint listed
# is preferred when two endpoints are equally loaded.
# When false (default), equally-idle endpoints are chosen at random.
# priority_routing: true
# Optional router-level API key that gates router/API/web UI access (leave empty to disable)
nomyo-router-api-key: ""

View file

@ -32,6 +32,16 @@ endpoints:
# Maximum concurrent connections *per endpointmodel pair* (equals to OLLAMA_NUM_PARALLEL)
max_concurrent_connections: 2
# Per-endpoint overrides — any field not listed falls back to the global default (optional)
# endpoint_config:
# "http://192.168.0.50:11434":
# max_concurrent_connections: 4
# "http://192.168.0.51:11434":
# max_concurrent_connections: 1
# Priority / WRR routing (optional, default: false)
# priority_routing: true
# Optional router-level API key to secure the router and dashboard (leave blank to disable)
nomyo-router-api-key: ""
@ -86,6 +96,76 @@ max_concurrent_connections: 4
- When this limit is reached, the router will route requests to other endpoints with available capacity
- Higher values allow more parallel requests but may increase memory usage
### `endpoint_config`
**Type**: `dict[str, dict]` (optional)
**Default**: `{}` (all endpoints use the global `max_concurrent_connections`)
**Description**: Per-endpoint overrides for configuration values. The endpoint URL must match the entry in `endpoints` exactly. Any field not listed falls back to the global default.
**Supported per-endpoint fields**:
| Field | Description |
|---|---|
| `max_concurrent_connections` | Overrides the global limit for this endpoint only |
**Example**:
```yaml
endpoint_config:
"http://192.168.0.50:11434":
max_concurrent_connections: 4 # high-memory GPU node
"http://192.168.0.51:11434":
max_concurrent_connections: 1 # low-memory node
```
**Notes**:
- Useful when endpoints have different hardware capacity.
- The utilization ratio used by WRR (`priority_routing: true`) is computed per-endpoint using the effective limit, so a node with `max_concurrent_connections: 4` running 2 requests is considered 50% utilized, same as a node with limit 2 running 1 request.
---
### `priority_routing`
**Type**: `bool` (optional)
**Default**: `false`
**Description**: Selects the load-balancing algorithm used when multiple endpoints are available for a request.
| Value | Algorithm |
|---|---|
| `false` (default) | Random selection among equally-idle endpoints; otherwise pick the least-loaded endpoint by raw connection count. |
| `true` | **Weighted Round Robin (WRR)** — endpoints are ranked by utilization ratio (`active_connections / max_concurrent_connections`). Config order acts as the tiebreaker: the endpoint listed first in `endpoints` is preferred when two candidates have equal utilization. |
**Example**:
```yaml
priority_routing: true
```
**When to use WRR**:
- You have a primary GPU node and one or more fallback nodes, and want the primary to absorb all traffic until it is genuinely saturated.
- Combined with `endpoint_config` to give the primary a higher `max_concurrent_connections`, so the utilization ratio reflects real capacity rather than raw slot counts.
**Example — primary/fallback setup**:
```yaml
endpoints:
- http://gpu-primary:11434 # preferred
- http://gpu-secondary:11434 # fallback
endpoint_config:
"http://gpu-primary:11434":
max_concurrent_connections: 4
"http://gpu-secondary:11434":
max_concurrent_connections: 2
priority_routing: true
```
With this config the primary handles up to 4 concurrent requests before the secondary receives any traffic.
---
### `router_api_key`
**Type**: `str` (optional)
@ -204,6 +284,26 @@ max_concurrent_connections: 3
**Recommendation**: Use multiple endpoints for redundancy and load distribution.
### Priority Routing (Primary + Fallback)
When you have heterogeneous hardware and want to prefer a faster node:
```yaml
endpoints:
- http://gpu-primary:11434 # high-VRAM node, listed first = highest priority
- http://gpu-secondary:11434
endpoint_config:
"http://gpu-primary:11434":
max_concurrent_connections: 4
"http://gpu-secondary:11434":
max_concurrent_connections: 2
priority_routing: true
```
The router sends all requests to the primary until its utilization ratio reaches 100%, then spills over to the secondary. Without `priority_routing: true` the default behaviour is random selection among idle endpoints.
## Semantic LLM Cache
NOMYO Router can cache LLM responses and serve them directly — skipping endpoint selection, model load, and token generation entirely.

203
router.py
View file

@ -6,7 +6,7 @@ version: 0.7
license: AGPL
"""
# -------------------------------------------------------------
import orjson, time, asyncio, yaml, ollama, openai, os, re, aiohttp, ssl, random, base64, io, enhance, secrets, math, socket
import orjson, time, asyncio, yaml, ollama, openai, os, re, aiohttp, ssl, random, base64, io, enhance, secrets, math, socket, httpx
try:
import truststore; truststore.inject_into_ssl()
except ImportError:
@ -185,6 +185,8 @@ _CTX_TRIM_SMALL_LIMIT = 32768 # only proactively trim models with n_ctx at or b
app_state = {
"session": None,
"connector": None,
"socket_sessions": {}, # endpoint -> aiohttp.ClientSession(UnixConnector) for .sock endpoints
"httpx_clients": {}, # endpoint -> httpx.AsyncClient(UDS transport) for .sock endpoints
}
token_worker_task: asyncio.Task | None = None
flush_task: asyncio.Task | None = None
@ -216,6 +218,10 @@ class Config(BaseSettings):
llama_server_endpoints: List[str] = Field(default_factory=list)
# Max concurrent connections per endpointmodel pair, see OLLAMA_NUM_PARALLEL
max_concurrent_connections: int = 1
# Per-endpoint overrides: {endpoint_url: {max_concurrent_connections: N}}
endpoint_config: Dict[str, Dict] = Field(default_factory=dict)
# When True, config order = priority; routes by utilization ratio + config index (WRR)
priority_routing: bool = Field(default=False)
api_keys: Dict[str, str] = Field(default_factory=dict)
# Optional router-level API key used to gate access to this service and dashboard
@ -494,6 +500,65 @@ def _extract_llama_quant(name: str) -> str:
return name.rsplit(":", 1)[1]
return ""
def _is_unix_socket_endpoint(endpoint: str) -> bool:
"""Return True if endpoint uses Unix socket (.sock hostname convention).
Detects URLs like http://192.168.0.52.sock/v1 where the host ends with
.sock, indicating the connection should use a Unix domain socket at
/tmp/<host> instead of TCP.
"""
try:
host = endpoint.split("//", 1)[1].split("/")[0].split(":")[0]
return host.endswith(".sock")
except IndexError:
return False
def _get_socket_path(endpoint: str) -> str:
"""Derive Unix socket file path from a .sock endpoint URL.
http://192.168.0.52.sock/v1 -> /run/user/<uid>/192.168.0.52.sock
"""
host = endpoint.split("//", 1)[1].split("/")[0].split(":")[0]
return f"/run/user/{os.getuid()}/{host}"
def get_session(endpoint: str) -> aiohttp.ClientSession:
"""Return the appropriate aiohttp session for the given endpoint.
Unix socket endpoints (.sock) get their own UnixConnector session.
All other endpoints share the main TCP session.
"""
if _is_unix_socket_endpoint(endpoint):
sess = app_state["socket_sessions"].get(endpoint)
if sess is not None:
return sess
return app_state["session"]
def _make_openai_client(
endpoint: str,
default_headers: dict | None = None,
api_key: str = "no-key",
) -> openai.AsyncOpenAI:
"""Return an AsyncOpenAI client configured for the given endpoint.
For Unix socket endpoints, injects a pre-created httpx UDS transport
so the OpenAI SDK connects via the socket instead of TCP.
"""
base_url = ep2base(endpoint)
kwargs: dict = {"api_key": api_key}
if default_headers is not None:
kwargs["default_headers"] = default_headers
if _is_unix_socket_endpoint(endpoint):
http_client = app_state["httpx_clients"].get(endpoint)
if http_client is not None:
kwargs["http_client"] = http_client
base_url = "http://localhost/v1"
return openai.AsyncOpenAI(base_url=base_url, **kwargs)
def _is_llama_model_loaded(item: dict) -> bool:
"""Return True if a llama-server /v1/models item has status 'loaded'.
Handles both dict format ({"value": "loaded"}) and plain string ("loaded").
@ -733,7 +798,7 @@ class fetch:
endpoint_url = f"{endpoint}/api/tags"
key = "models"
client: aiohttp.ClientSession = app_state["session"]
client: aiohttp.ClientSession = get_session(endpoint)
try:
async with client.get(endpoint_url, headers=headers) as resp:
await _ensure_success(resp)
@ -854,7 +919,7 @@ class fetch:
For Ollama endpoints: queries /api/ps and returns model names
For llama-server endpoints: queries /v1/models and filters for status.value == "loaded"
"""
client: aiohttp.ClientSession = app_state["session"]
client: aiohttp.ClientSession = get_session(endpoint)
# Check if this is a llama-server endpoint
if endpoint in config.llama_server_endpoints:
@ -981,7 +1046,7 @@ class fetch:
if _inflight_loaded_models.get(endpoint) == task:
_inflight_loaded_models.pop(endpoint, None)
async def endpoint_details(endpoint: str, route: str, detail: str, api_key: Optional[str] = None, skip_error_cache: bool = False) -> List[dict]:
async def endpoint_details(endpoint: str, route: str, detail: str, api_key: Optional[str] = None, skip_error_cache: bool = False, timeout: float = None) -> List[dict]:
"""
Query <endpoint>/<route> to fetch <detail> and return a List of dicts with details
for the corresponding Ollama endpoint. If the request fails we respond with "N/A" for detail.
@ -989,6 +1054,8 @@ class fetch:
When ``skip_error_cache`` is False (the default), the call is short-circuited
if the endpoint recently failed (recorded in ``_available_error_cache``).
Pass ``skip_error_cache=True`` from health-check routes that must always probe.
``timeout`` overrides the session default for this single request (seconds, total).
"""
# Fast-fail if the endpoint is known to be down (unless caller opts out)
if not skip_error_cache:
@ -997,14 +1064,17 @@ class fetch:
if _is_fresh(_available_error_cache[endpoint], 300):
return []
client: aiohttp.ClientSession = app_state["session"]
client: aiohttp.ClientSession = get_session(endpoint)
headers = None
if api_key is not None:
headers = {"Authorization": "Bearer " + api_key}
request_url = f"{endpoint}{route}"
req_kwargs = {}
if timeout is not None:
req_kwargs["timeout"] = aiohttp.ClientTimeout(total=timeout)
try:
async with client.get(request_url, headers=headers) as resp:
async with client.get(request_url, headers=headers, **req_kwargs) as resp:
await _ensure_success(resp)
data = await resp.json()
detail = data.get(detail, [])
@ -1013,6 +1083,9 @@ class fetch:
# If anything goes wrong we cannot reply details
message = _format_connection_issue(request_url, e)
print(f"[fetch.endpoint_details] {message}")
if not skip_error_cache:
async with _available_error_cache_lock:
_available_error_cache[endpoint] = time.time()
return []
def ep2base(ep):
@ -1089,7 +1162,7 @@ async def _make_chat_request(model: str, messages: list, tools=None, stream: boo
"response_format": {"type": "json_schema", "json_schema": format} if format is not None else None
}
params.update({k: v for k, v in optional_params.items() if v is not None})
oclient = openai.AsyncOpenAI(base_url=ep2base(endpoint), default_headers=default_headers, api_key=config.api_keys.get(endpoint, "no-key"))
oclient = _make_openai_client(endpoint, default_headers=default_headers, api_key=config.api_keys.get(endpoint, "no-key"))
else:
client = ollama.AsyncClient(host=endpoint)
@ -1636,6 +1709,12 @@ async def get_usage_counts() -> Dict:
# -------------------------------------------------------------
# 5. Endpoint selection logic (respecting the configurable limit)
# -------------------------------------------------------------
def get_max_connections(ep: str) -> int:
"""Per-endpoint max_concurrent_connections, falling back to the global value."""
return config.endpoint_config.get(ep, {}).get(
"max_concurrent_connections", config.max_concurrent_connections
)
async def choose_endpoint(model: str, reserve: bool = True) -> tuple[str, str]:
"""
Determine which endpoint to use for the given model while respecting
@ -1706,13 +1785,26 @@ async def choose_endpoint(model: str, reserve: bool = True) -> tuple[str, str]:
def tracking_usage(ep: str) -> int:
return usage_counts.get(ep, {}).get(get_tracking_model(ep, model), 0)
def utilization_ratio(ep: str) -> float:
return tracking_usage(ep) / get_max_connections(ep)
# Priority map: position in all_endpoints list (lower = higher priority)
ep_priority = {ep: i for i, ep in enumerate(all_endpoints)}
# 3⃣ Endpoints that have the model loaded *and* a free slot
loaded_and_free = [
ep for ep, models in zip(candidate_endpoints, loaded_sets)
if model in models and tracking_usage(ep) < config.max_concurrent_connections
if model in models and tracking_usage(ep) < get_max_connections(ep)
]
if loaded_and_free:
if config.priority_routing:
# WRR: sort by config order first (stable), then by utilization ratio.
# Stable sort preserves priority for equal-ratio endpoints.
loaded_and_free.sort(key=lambda ep: ep_priority.get(ep, 999))
loaded_and_free.sort(key=utilization_ratio)
selected = loaded_and_free[0]
else:
# Sort ascending for load balancing — all endpoints here already have the
# model loaded, so there is no model-switching cost to optimise for.
loaded_and_free.sort(key=tracking_usage)
@ -1726,10 +1818,15 @@ async def choose_endpoint(model: str, reserve: bool = True) -> tuple[str, str]:
# 4⃣ Endpoints among the candidates that simply have a free slot
endpoints_with_free_slot = [
ep for ep in candidate_endpoints
if tracking_usage(ep) < config.max_concurrent_connections
if tracking_usage(ep) < get_max_connections(ep)
]
if endpoints_with_free_slot:
if config.priority_routing:
endpoints_with_free_slot.sort(key=lambda ep: ep_priority.get(ep, 999))
endpoints_with_free_slot.sort(key=utilization_ratio)
selected = endpoints_with_free_slot[0]
else:
# Sort by total endpoint load (ascending) to prefer idle endpoints.
endpoints_with_free_slot.sort(
key=lambda ep: sum(usage_counts.get(ep, {}).values())
@ -1740,6 +1837,12 @@ async def choose_endpoint(model: str, reserve: bool = True) -> tuple[str, str]:
selected = endpoints_with_free_slot[0]
else:
# 5⃣ All candidate endpoints are saturated pick the least-busy one (will queue)
if config.priority_routing:
selected = min(
candidate_endpoints,
key=lambda ep: (utilization_ratio(ep), ep_priority.get(ep, 999)),
)
else:
selected = min(candidate_endpoints, key=tracking_usage)
tracking_model = get_tracking_model(selected, model)
@ -1822,7 +1925,7 @@ async def proxy(request: Request):
"suffix": suffix,
}
params.update({k: v for k, v in optional_params.items() if v is not None})
oclient = openai.AsyncOpenAI(base_url=ep2base(endpoint), default_headers=default_headers, api_key=config.api_keys.get(endpoint, "no-key"))
oclient = _make_openai_client(endpoint, default_headers=default_headers, api_key=config.api_keys.get(endpoint, "no-key"))
else:
client = ollama.AsyncClient(host=endpoint)
@ -2000,7 +2103,7 @@ async def chat_proxy(request: Request):
"response_format": {"type": "json_schema", "json_schema": _format} if _format is not None else None
}
params.update({k: v for k, v in optional_params.items() if v is not None})
oclient = openai.AsyncOpenAI(base_url=ep2base(endpoint), default_headers=default_headers, api_key=config.api_keys.get(endpoint, "no-key"))
oclient = _make_openai_client(endpoint, default_headers=default_headers, api_key=config.api_keys.get(endpoint, "no-key"))
else:
client = ollama.AsyncClient(host=endpoint)
# For OpenAI endpoints: make the API call in handler scope
@ -2220,7 +2323,7 @@ async def embedding_proxy(request: Request):
if ":latest" in model:
model = model.split(":latest")
model = model[0]
client = openai.AsyncOpenAI(base_url=ep2base(endpoint), api_key=config.api_keys.get(endpoint, "no-key"))
client = _make_openai_client(endpoint, api_key=config.api_keys.get(endpoint, "no-key"))
else:
client = ollama.AsyncClient(host=endpoint)
# 3. Async generator that streams embedding data and decrements the counter
@ -2285,7 +2388,7 @@ async def embed_proxy(request: Request):
if ":latest" in model:
model = model.split(":latest")
model = model[0]
client = openai.AsyncOpenAI(base_url=ep2base(endpoint), api_key=config.api_keys.get(endpoint, "no-key"))
client = _make_openai_client(endpoint, api_key=config.api_keys.get(endpoint, "no-key"))
else:
client = ollama.AsyncClient(host=endpoint)
# 3. Async generator that streams embed data and decrements the counter
@ -2682,11 +2785,11 @@ async def tags_proxy(request: Request):
"""
# 1. Query all endpoints for models
tasks = [fetch.endpoint_details(ep, "/api/tags", "models") for ep in config.endpoints if "/v1" not in ep]
tasks += [fetch.endpoint_details(ep, "/models", "data", config.api_keys[ep]) for ep in config.endpoints if "/v1" in ep]
tasks = [fetch.endpoint_details(ep, "/api/tags", "models", skip_error_cache=True, timeout=8) for ep in config.endpoints if "/v1" not in ep]
tasks += [fetch.endpoint_details(ep, "/models", "data", config.api_keys[ep], skip_error_cache=True, timeout=8) for ep in config.endpoints if "/v1" in ep]
# Also query llama-server endpoints not already covered by config.endpoints
llama_eps_for_tags = [ep for ep in config.llama_server_endpoints if ep not in config.endpoints]
tasks += [fetch.endpoint_details(ep, "/models", "data", config.api_keys.get(ep)) for ep in llama_eps_for_tags]
tasks += [fetch.endpoint_details(ep, "/models", "data", config.api_keys.get(ep), skip_error_cache=True, timeout=8) for ep in llama_eps_for_tags]
all_models = await asyncio.gather(*tasks)
models = {'models': []}
@ -2720,12 +2823,12 @@ async def ps_proxy(request: Request):
For llama-server endpoints: queries /v1/models with status.value == "loaded"
"""
# 1. Query Ollama endpoints for running models via /api/ps
ollama_tasks = [fetch.endpoint_details(ep, "/api/ps", "models") for ep in config.endpoints if "/v1" not in ep]
ollama_tasks = [fetch.endpoint_details(ep, "/api/ps", "models", skip_error_cache=True, timeout=8) for ep in config.endpoints if "/v1" not in ep]
# 2. Query llama-server endpoints for loaded models via /v1/models
# Also query endpoints from llama_server_endpoints that may not be in config.endpoints
all_llama_endpoints = set(config.llama_server_endpoints) | set(ep for ep in config.endpoints if ep in config.llama_server_endpoints)
llama_tasks = [
fetch.endpoint_details(ep, "/models", "data", config.api_keys.get(ep))
fetch.endpoint_details(ep, "/models", "data", config.api_keys.get(ep), skip_error_cache=True, timeout=8)
for ep in all_llama_endpoints
]
@ -2775,12 +2878,12 @@ async def ps_details_proxy(request: Request):
For llama-server endpoints: queries /v1/models with status info
"""
# 1. Query Ollama endpoints via /api/ps
ollama_tasks = [(ep, fetch.endpoint_details(ep, "/api/ps", "models")) for ep in config.endpoints if "/v1" not in ep]
ollama_tasks = [(ep, fetch.endpoint_details(ep, "/api/ps", "models", skip_error_cache=True, timeout=8)) for ep in config.endpoints if "/v1" not in ep]
# 2. Query llama-server endpoints via /v1/models
# Also query endpoints from llama_server_endpoints that may not be in config.endpoints
all_llama_endpoints = set(config.llama_server_endpoints) | set(ep for ep in config.endpoints if ep in config.llama_server_endpoints)
llama_tasks = [
(ep, fetch.endpoint_details(ep, "/models", "data", config.api_keys.get(ep)))
(ep, fetch.endpoint_details(ep, "/models", "data", config.api_keys.get(ep), skip_error_cache=True, timeout=8))
for ep in all_llama_endpoints
]
@ -2834,7 +2937,7 @@ async def ps_details_proxy(request: Request):
# Fetch /props for each llama-server model to get context length (n_ctx)
# and unload sleeping models automatically
async def _fetch_llama_props(endpoint: str, model_id: str) -> tuple[int | None, bool, bool]:
client: aiohttp.ClientSession = app_state["session"]
client: aiohttp.ClientSession = get_session(endpoint)
base_url = endpoint.rstrip("/").removesuffix("/v1")
props_url = f"{base_url}/props?model={model_id}"
headers = None
@ -2842,7 +2945,7 @@ async def ps_details_proxy(request: Request):
if api_key:
headers = {"Authorization": f"Bearer {api_key}"}
try:
async with client.get(props_url, headers=headers) as resp:
async with client.get(props_url, headers=headers, timeout=aiohttp.ClientTimeout(total=5)) as resp:
if resp.status == 200:
data = await resp.json()
dgs = data.get("default_generation_settings", {})
@ -2907,7 +3010,7 @@ async def config_proxy(request: Request):
which endpoints are being proxied.
"""
async def check_endpoint(url: str):
client: aiohttp.ClientSession = app_state["session"]
client: aiohttp.ClientSession = get_session(url)
headers = None
if "/v1" in url:
headers = {"Authorization": "Bearer " + config.api_keys.get(url, "no-key")}
@ -2916,7 +3019,7 @@ async def config_proxy(request: Request):
target_url = f"{url}/api/version"
try:
async with client.get(target_url, headers=headers) as resp:
async with client.get(target_url, headers=headers, timeout=aiohttp.ClientTimeout(total=5)) as resp:
await _ensure_success(resp)
data = await resp.json()
if "/v1" in url:
@ -2990,9 +3093,7 @@ async def openai_embedding_proxy(request: Request):
api_key = config.api_keys.get(endpoint, "no-key")
else:
api_key = "ollama"
base_url = ep2base(endpoint)
oclient = openai.AsyncOpenAI(base_url=base_url, default_headers=default_headers, api_key=api_key)
oclient = _make_openai_client(endpoint, default_headers=default_headers, api_key=api_key)
try:
async_gen = await oclient.embeddings.create(input=doc, model=model)
@ -3105,8 +3206,7 @@ async def openai_chat_completions_proxy(request: Request):
# 2. Endpoint logic
endpoint, tracking_model = await choose_endpoint(model)
base_url = ep2base(endpoint)
oclient = openai.AsyncOpenAI(base_url=base_url, default_headers=default_headers, api_key=config.api_keys.get(endpoint, "no-key"))
oclient = _make_openai_client(endpoint, default_headers=default_headers, api_key=config.api_keys.get(endpoint, "no-key"))
# 3. Helpers and API call — done in handler scope so try/except works reliably
async def _normalize_images_in_messages(msgs: list) -> list:
"""Fetch remote image URLs and convert them to base64 data URLs so
@ -3416,8 +3516,7 @@ async def openai_completions_proxy(request: Request):
# 2. Endpoint logic
endpoint, tracking_model = await choose_endpoint(model)
base_url = ep2base(endpoint)
oclient = openai.AsyncOpenAI(base_url=base_url, default_headers=default_headers, api_key=config.api_keys.get(endpoint, "no-key"))
oclient = _make_openai_client(endpoint, default_headers=default_headers, api_key=config.api_keys.get(endpoint, "no-key"))
# 3. Async generator that streams completions data and decrements the counter
# Make the API call in handler scope (try/except inside async generators is unreliable)
@ -3525,14 +3624,14 @@ async def openai_models_proxy(request: Request):
For llama-server endpoints: queries /v1/models and filters for status.value == "loaded"
"""
# 1. Query Ollama endpoints for all models via /api/tags
ollama_tasks = [fetch.endpoint_details(ep, "/api/tags", "models") for ep in config.endpoints if "/v1" not in ep]
ollama_tasks = [fetch.endpoint_details(ep, "/api/tags", "models", skip_error_cache=True, timeout=8) for ep in config.endpoints if "/v1" not in ep]
# 2. Query external OpenAI endpoints (Groq, OpenAI, etc.) via /models
ext_openai_tasks = [fetch.endpoint_details(ep, "/models", "data", config.api_keys.get(ep)) for ep in config.endpoints if is_ext_openai_endpoint(ep)]
ext_openai_tasks = [fetch.endpoint_details(ep, "/models", "data", config.api_keys.get(ep), skip_error_cache=True, timeout=8) for ep in config.endpoints if is_ext_openai_endpoint(ep)]
# 3. Query llama-server endpoints for loaded models via /v1/models
# Also query endpoints from llama_server_endpoints that may not be in config.endpoints
all_llama_endpoints = set(config.llama_server_endpoints) | set(ep for ep in config.endpoints if ep in config.llama_server_endpoints)
llama_tasks = [
fetch.endpoint_details(ep, "/models", "data", config.api_keys.get(ep))
fetch.endpoint_details(ep, "/models", "data", config.api_keys.get(ep), skip_error_cache=True, timeout=8)
for ep in all_llama_endpoints
]
@ -3672,7 +3771,7 @@ async def rerank_proxy(request: Request):
"Authorization": f"Bearer {api_key}",
}
client: aiohttp.ClientSession = app_state["session"]
client: aiohttp.ClientSession = get_session(endpoint)
try:
async with client.post(rerank_url, json=upstream_payload, headers=headers) as resp:
response_bytes = await resp.read()
@ -3845,7 +3944,9 @@ async def startup_event() -> None:
f"Loaded configuration from {config_path}:\n"
f" endpoints={config.endpoints},\n"
f" llama_server_endpoints={config.llama_server_endpoints},\n"
f" max_concurrent_connections={config.max_concurrent_connections}"
f" max_concurrent_connections={config.max_concurrent_connections},\n"
f" endpoint_config={config.endpoint_config},\n"
f" priority_routing={config.priority_routing}"
)
else:
print(
@ -3874,6 +3975,19 @@ async def startup_event() -> None:
app_state["connector"] = connector
app_state["session"] = session
# Create per-endpoint Unix socket sessions for .sock endpoints
for ep in config.llama_server_endpoints:
if _is_unix_socket_endpoint(ep):
sock_path = _get_socket_path(ep)
sock_connector = aiohttp.UnixConnector(path=sock_path)
sock_timeout = aiohttp.ClientTimeout(total=300, connect=5, sock_read=300)
sock_session = aiohttp.ClientSession(connector=sock_connector, timeout=sock_timeout)
app_state["socket_sessions"][ep] = sock_session
transport = httpx.AsyncHTTPTransport(uds=sock_path)
app_state["httpx_clients"][ep] = httpx.AsyncClient(transport=transport, timeout=300.0)
print(f"[startup] Unix socket session: {ep} -> {sock_path}")
token_worker_task = asyncio.create_task(token_worker())
flush_task = asyncio.create_task(flush_buffer())
await init_llm_cache(config)
@ -3883,6 +3997,23 @@ async def shutdown_event() -> None:
await close_all_sse_queues()
await flush_remaining_buffers()
await app_state["session"].close()
# Close Unix socket sessions
for ep, sess in list(app_state.get("socket_sessions", {}).items()):
try:
await sess.close()
print(f"[shutdown] Closed Unix socket session: {ep}")
except Exception as e:
print(f"[shutdown] Error closing Unix socket session {ep}: {e}")
# Close httpx Unix socket clients
for ep, client in list(app_state.get("httpx_clients", {}).items()):
try:
await client.aclose()
print(f"[shutdown] Closed httpx client: {ep}")
except Exception as e:
print(f"[shutdown] Error closing httpx client {ep}: {e}")
if token_worker_task is not None:
token_worker_task.cancel()
if flush_task is not None: