diff --git a/doc/configuration.md b/doc/configuration.md index af2dfae..7d9986a 100644 --- a/doc/configuration.md +++ b/doc/configuration.md @@ -32,6 +32,16 @@ endpoints: # Maximum concurrent connections *per endpoint‑model pair* (equals to OLLAMA_NUM_PARALLEL) max_concurrent_connections: 2 +# Per-endpoint overrides — any field not listed falls back to the global default (optional) +# endpoint_config: +# "http://192.168.0.50:11434": +# max_concurrent_connections: 4 +# "http://192.168.0.51:11434": +# max_concurrent_connections: 1 + +# Priority / WRR routing (optional, default: false) +# priority_routing: true + # Optional router-level API key to secure the router and dashboard (leave blank to disable) nomyo-router-api-key: "" @@ -86,6 +96,76 @@ max_concurrent_connections: 4 - When this limit is reached, the router will route requests to other endpoints with available capacity - Higher values allow more parallel requests but may increase memory usage +### `endpoint_config` + +**Type**: `dict[str, dict]` (optional) + +**Default**: `{}` (all endpoints use the global `max_concurrent_connections`) + +**Description**: Per-endpoint overrides for configuration values. The endpoint URL must match the entry in `endpoints` exactly. Any field not listed falls back to the global default. + +**Supported per-endpoint fields**: + +| Field | Description | +|---|---| +| `max_concurrent_connections` | Overrides the global limit for this endpoint only | + +**Example**: +```yaml +endpoint_config: + "http://192.168.0.50:11434": + max_concurrent_connections: 4 # high-memory GPU node + "http://192.168.0.51:11434": + max_concurrent_connections: 1 # low-memory node +``` + +**Notes**: +- Useful when endpoints have different hardware capacity. +- The utilization ratio used by WRR (`priority_routing: true`) is computed per-endpoint using the effective limit, so a node with `max_concurrent_connections: 4` running 2 requests is considered 50% utilized, same as a node with limit 2 running 1 request. + +--- + +### `priority_routing` + +**Type**: `bool` (optional) + +**Default**: `false` + +**Description**: Selects the load-balancing algorithm used when multiple endpoints are available for a request. + +| Value | Algorithm | +|---|---| +| `false` (default) | Random selection among equally-idle endpoints; otherwise pick the least-loaded endpoint by raw connection count. | +| `true` | **Weighted Round Robin (WRR)** — endpoints are ranked by utilization ratio (`active_connections / max_concurrent_connections`). Config order acts as the tiebreaker: the endpoint listed first in `endpoints` is preferred when two candidates have equal utilization. | + +**Example**: +```yaml +priority_routing: true +``` + +**When to use WRR**: +- You have a primary GPU node and one or more fallback nodes, and want the primary to absorb all traffic until it is genuinely saturated. +- Combined with `endpoint_config` to give the primary a higher `max_concurrent_connections`, so the utilization ratio reflects real capacity rather than raw slot counts. + +**Example — primary/fallback setup**: +```yaml +endpoints: + - http://gpu-primary:11434 # preferred + - http://gpu-secondary:11434 # fallback + +endpoint_config: + "http://gpu-primary:11434": + max_concurrent_connections: 4 + "http://gpu-secondary:11434": + max_concurrent_connections: 2 + +priority_routing: true +``` + +With this config the primary handles up to 4 concurrent requests before the secondary receives any traffic. + +--- + ### `router_api_key` **Type**: `str` (optional) @@ -204,6 +284,26 @@ max_concurrent_connections: 3 **Recommendation**: Use multiple endpoints for redundancy and load distribution. +### Priority Routing (Primary + Fallback) + +When you have heterogeneous hardware and want to prefer a faster node: + +```yaml +endpoints: + - http://gpu-primary:11434 # high-VRAM node, listed first = highest priority + - http://gpu-secondary:11434 + +endpoint_config: + "http://gpu-primary:11434": + max_concurrent_connections: 4 + "http://gpu-secondary:11434": + max_concurrent_connections: 2 + +priority_routing: true +``` + +The router sends all requests to the primary until its utilization ratio reaches 100%, then spills over to the secondary. Without `priority_routing: true` the default behaviour is random selection among idle endpoints. + ## Semantic LLM Cache NOMYO Router can cache LLM responses and serve them directly — skipping endpoint selection, model load, and token generation entirely.