doc: primary routing and max_connections per endpoint added
This commit is contained in:
parent
182ddae539
commit
ceeba9a0cd
1 changed files with 100 additions and 0 deletions
|
|
@ -32,6 +32,16 @@ endpoints:
|
|||
# Maximum concurrent connections *per endpoint‑model pair* (equals to OLLAMA_NUM_PARALLEL)
|
||||
max_concurrent_connections: 2
|
||||
|
||||
# Per-endpoint overrides — any field not listed falls back to the global default (optional)
|
||||
# endpoint_config:
|
||||
# "http://192.168.0.50:11434":
|
||||
# max_concurrent_connections: 4
|
||||
# "http://192.168.0.51:11434":
|
||||
# max_concurrent_connections: 1
|
||||
|
||||
# Priority / WRR routing (optional, default: false)
|
||||
# priority_routing: true
|
||||
|
||||
# Optional router-level API key to secure the router and dashboard (leave blank to disable)
|
||||
nomyo-router-api-key: ""
|
||||
|
||||
|
|
@ -86,6 +96,76 @@ max_concurrent_connections: 4
|
|||
- When this limit is reached, the router will route requests to other endpoints with available capacity
|
||||
- Higher values allow more parallel requests but may increase memory usage
|
||||
|
||||
### `endpoint_config`
|
||||
|
||||
**Type**: `dict[str, dict]` (optional)
|
||||
|
||||
**Default**: `{}` (all endpoints use the global `max_concurrent_connections`)
|
||||
|
||||
**Description**: Per-endpoint overrides for configuration values. The endpoint URL must match the entry in `endpoints` exactly. Any field not listed falls back to the global default.
|
||||
|
||||
**Supported per-endpoint fields**:
|
||||
|
||||
| Field | Description |
|
||||
|---|---|
|
||||
| `max_concurrent_connections` | Overrides the global limit for this endpoint only |
|
||||
|
||||
**Example**:
|
||||
```yaml
|
||||
endpoint_config:
|
||||
"http://192.168.0.50:11434":
|
||||
max_concurrent_connections: 4 # high-memory GPU node
|
||||
"http://192.168.0.51:11434":
|
||||
max_concurrent_connections: 1 # low-memory node
|
||||
```
|
||||
|
||||
**Notes**:
|
||||
- Useful when endpoints have different hardware capacity.
|
||||
- The utilization ratio used by WRR (`priority_routing: true`) is computed per-endpoint using the effective limit, so a node with `max_concurrent_connections: 4` running 2 requests is considered 50% utilized, same as a node with limit 2 running 1 request.
|
||||
|
||||
---
|
||||
|
||||
### `priority_routing`
|
||||
|
||||
**Type**: `bool` (optional)
|
||||
|
||||
**Default**: `false`
|
||||
|
||||
**Description**: Selects the load-balancing algorithm used when multiple endpoints are available for a request.
|
||||
|
||||
| Value | Algorithm |
|
||||
|---|---|
|
||||
| `false` (default) | Random selection among equally-idle endpoints; otherwise pick the least-loaded endpoint by raw connection count. |
|
||||
| `true` | **Weighted Round Robin (WRR)** — endpoints are ranked by utilization ratio (`active_connections / max_concurrent_connections`). Config order acts as the tiebreaker: the endpoint listed first in `endpoints` is preferred when two candidates have equal utilization. |
|
||||
|
||||
**Example**:
|
||||
```yaml
|
||||
priority_routing: true
|
||||
```
|
||||
|
||||
**When to use WRR**:
|
||||
- You have a primary GPU node and one or more fallback nodes, and want the primary to absorb all traffic until it is genuinely saturated.
|
||||
- Combined with `endpoint_config` to give the primary a higher `max_concurrent_connections`, so the utilization ratio reflects real capacity rather than raw slot counts.
|
||||
|
||||
**Example — primary/fallback setup**:
|
||||
```yaml
|
||||
endpoints:
|
||||
- http://gpu-primary:11434 # preferred
|
||||
- http://gpu-secondary:11434 # fallback
|
||||
|
||||
endpoint_config:
|
||||
"http://gpu-primary:11434":
|
||||
max_concurrent_connections: 4
|
||||
"http://gpu-secondary:11434":
|
||||
max_concurrent_connections: 2
|
||||
|
||||
priority_routing: true
|
||||
```
|
||||
|
||||
With this config the primary handles up to 4 concurrent requests before the secondary receives any traffic.
|
||||
|
||||
---
|
||||
|
||||
### `router_api_key`
|
||||
|
||||
**Type**: `str` (optional)
|
||||
|
|
@ -204,6 +284,26 @@ max_concurrent_connections: 3
|
|||
|
||||
**Recommendation**: Use multiple endpoints for redundancy and load distribution.
|
||||
|
||||
### Priority Routing (Primary + Fallback)
|
||||
|
||||
When you have heterogeneous hardware and want to prefer a faster node:
|
||||
|
||||
```yaml
|
||||
endpoints:
|
||||
- http://gpu-primary:11434 # high-VRAM node, listed first = highest priority
|
||||
- http://gpu-secondary:11434
|
||||
|
||||
endpoint_config:
|
||||
"http://gpu-primary:11434":
|
||||
max_concurrent_connections: 4
|
||||
"http://gpu-secondary:11434":
|
||||
max_concurrent_connections: 2
|
||||
|
||||
priority_routing: true
|
||||
```
|
||||
|
||||
The router sends all requests to the primary until its utilization ratio reaches 100%, then spills over to the secondary. Without `priority_routing: true` the default behaviour is random selection among idle endpoints.
|
||||
|
||||
## Semantic LLM Cache
|
||||
|
||||
NOMYO Router can cache LLM responses and serve them directly — skipping endpoint selection, model load, and token generation entirely.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue