doc: primary routing and max_connections per endpoint added

2026-05-01 13:55:29 +02:00 · 2026-05-01 13:55:29 +02:00 · ceeba9a0cd
commit ceeba9a0cd
parent 182ddae539
1 changed files with 100 additions and 0 deletions
--- a/doc/configuration.md
+++ b/doc/configuration.md
@ -32,6 +32,16 @@ endpoints:
 # Maximum concurrent connections *per endpoint‑model pair* (equals to OLLAMA_NUM_PARALLEL)
 max_concurrent_connections: 2

+# Per-endpoint overrides — any field not listed falls back to the global default (optional)
+# endpoint_config:
+#   "http://192.168.0.50:11434":
+#     max_concurrent_connections: 4
+#   "http://192.168.0.51:11434":
+#     max_concurrent_connections: 1
+
+# Priority / WRR routing (optional, default: false)
+# priority_routing: true
+
 # Optional router-level API key to secure the router and dashboard (leave blank to disable)
 nomyo-router-api-key: ""

@ -86,6 +96,76 @@ max_concurrent_connections: 4
 - When this limit is reached, the router will route requests to other endpoints with available capacity
 - Higher values allow more parallel requests but may increase memory usage

+### `endpoint_config`
+
+**Type**: `dict[str, dict]` (optional)
+
+**Default**: `{}` (all endpoints use the global `max_concurrent_connections`)
+
+**Description**: Per-endpoint overrides for configuration values. The endpoint URL must match the entry in `endpoints` exactly. Any field not listed falls back to the global default.
+
+**Supported per-endpoint fields**:
+
+| Field | Description |
+|---|---|
+| `max_concurrent_connections` | Overrides the global limit for this endpoint only |
+
+**Example**:
+```yaml
+endpoint_config:
+  "http://192.168.0.50:11434":
+    max_concurrent_connections: 4   # high-memory GPU node
+  "http://192.168.0.51:11434":
+    max_concurrent_connections: 1   # low-memory node
+```
+
+**Notes**:
+- Useful when endpoints have different hardware capacity.
+- The utilization ratio used by WRR (`priority_routing: true`) is computed per-endpoint using the effective limit, so a node with `max_concurrent_connections: 4` running 2 requests is considered 50% utilized, same as a node with limit 2 running 1 request.
+
+---
+
+### `priority_routing`
+
+**Type**: `bool` (optional)
+
+**Default**: `false`
+
+**Description**: Selects the load-balancing algorithm used when multiple endpoints are available for a request.
+
+| Value | Algorithm |
+|---|---|
+| `false` (default) | Random selection among equally-idle endpoints; otherwise pick the least-loaded endpoint by raw connection count. |
+| `true` | **Weighted Round Robin (WRR)** — endpoints are ranked by utilization ratio (`active_connections / max_concurrent_connections`). Config order acts as the tiebreaker: the endpoint listed first in `endpoints` is preferred when two candidates have equal utilization. |
+
+**Example**:
+```yaml
+priority_routing: true
+```
+
+**When to use WRR**:
+- You have a primary GPU node and one or more fallback nodes, and want the primary to absorb all traffic until it is genuinely saturated.
+- Combined with `endpoint_config` to give the primary a higher `max_concurrent_connections`, so the utilization ratio reflects real capacity rather than raw slot counts.
+
+**Example — primary/fallback setup**:
+```yaml
+endpoints:
+  - http://gpu-primary:11434    # preferred
+  - http://gpu-secondary:11434  # fallback
+
+endpoint_config:
+  "http://gpu-primary:11434":
+    max_concurrent_connections: 4
+  "http://gpu-secondary:11434":
+    max_concurrent_connections: 2
+
+priority_routing: true
+```
+
+With this config the primary handles up to 4 concurrent requests before the secondary receives any traffic.
+
+---
+
 ### `router_api_key`

 **Type**: `str` (optional)
@ -204,6 +284,26 @@ max_concurrent_connections: 3

 **Recommendation**: Use multiple endpoints for redundancy and load distribution.

+### Priority Routing (Primary + Fallback)
+
+When you have heterogeneous hardware and want to prefer a faster node:
+
+```yaml
+endpoints:
+  - http://gpu-primary:11434      # high-VRAM node, listed first = highest priority
+  - http://gpu-secondary:11434
+
+endpoint_config:
+  "http://gpu-primary:11434":
+    max_concurrent_connections: 4
+  "http://gpu-secondary:11434":
+    max_concurrent_connections: 2
+
+priority_routing: true
+```
+
+The router sends all requests to the primary until its utilization ratio reaches 100%, then spills over to the secondary. Without `priority_routing: true` the default behaviour is random selection among idle endpoints.
+
 ## Semantic LLM Cache

 NOMYO Router can cache LLM responses and serve them directly — skipping endpoint selection, model load, and token generation entirely.