feat: enhance load balancing

JTHesse commented

2026-04-16 10:14:22 +02:00

Hi, I would like to propose two features for the nomyo-router.
Currently, the max_concurrent_connections can be configured for all endpoints only.
As there might be highly different endpoint configuration, a value for each endpoint would improve load balancing in my opinion.

In addition to this, a priority list could be added. For example, I am using those endpoints:

endpoints:

None of those endpoints is processing any requests. If a new request arises I would like to be sure that it is processed by the ollama0 endpoint and only if the max_concurrent_connections is reached on ollama0, ollama1 will be requested.
This would allow the use of a "fast" endpoint primarily. Maybe this could be switched on/off as some users might want to use a random approach.
Thanks!

Hi, I would like to propose two features for the nomyo-router. Currently, the max_concurrent_connections can be configured for all endpoints only. As there might be highly different endpoint configuration, a value for each endpoint would improve load balancing in my opinion. In addition to this, a priority list could be added. For example, I am using those endpoints: endpoints: - http://ollama0:11434 - http://ollama1:11434 - http://ollama2:11434 None of those endpoints is processing any requests. If a new request arises I would like to be sure that it is processed by the ollama0 endpoint and only if the max_concurrent_connections is reached on ollama0, ollama1 will be requested. This would allow the use of a "fast" endpoint primarily. Maybe this could be switched on/off as some users might want to use a random approach. Thanks!

alpha-nerd commented

2026-04-16 11:21:04 +02:00

Owner

the current load-balancing approach with model aware routing requires equally powerful endpoints.
as this is rather an ideal world scenario I like the idea of having a mechanism to support this.
however, I can imagine a scenario where the scenario you describe is preventing an efficient load-balancing strategy.

if ollama0 is running model X with max_concurrent_connects = 3 it might cause to never "overflow" to ollama1 thus leaving this endpoints resources (with different config or capacity) unused, which would be a waste of resources.
furthermore it might be more convenient to users to utilize rather 2 or maybe even 3 endpoints, because they all would be capable of serving model X, but it would be totally different play when serving bigger model Y.

when your aim is accessing a "fast" endpoint, because the model is already loaded and you want to prevent cold-start delay, then this is what the model-aware-routing functionality has been designed for.
if "fast" means a more capable endpoint being able to serve a given model faster, then we need to account for the relative difference token generation of the individual endpoints.

an approach would be to configure it.
another to measure it, which then can be tricky as the endpoint during measurement can be stressed by concurrent request impacting the measurement. as we measure token count already, a solution might be to measure averages over time to get a better idea of individual endpoint capabilities...

... adding max_concurrent_connections on a per endpoint level will make this calculation even harder.

an easy fix would be to assign probabilities:

endpoint 1 -> 70%
endpoint 2 -> 30%

which is not ideal in some cases.

looking forward to your comments to carve out a feature that will serve well most of the users.

the current load-balancing approach with model aware routing requires equally powerful endpoints. as this is rather an ideal world scenario I like the idea of having a mechanism to support this. however, I can imagine a scenario where the scenario you describe is preventing an efficient load-balancing strategy. if ollama0 is running model X with max_concurrent_connects = 3 it might cause to never "overflow" to ollama1 thus leaving this endpoints resources (with different config or capacity) unused, which would be a waste of resources. furthermore it might be more convenient to users to utilize rather 2 or maybe even 3 endpoints, because they all would be capable of serving model X, but it would be totally different play when serving bigger model Y. when your aim is accessing a "fast" endpoint, because the model is already loaded and you want to prevent cold-start delay, then this is what the model-aware-routing functionality has been designed for. if "fast" means a more capable endpoint being able to serve a given model faster, then we need to account for the relative difference token generation of the individual endpoints. an approach would be to configure it. another to measure it, which then can be tricky as the endpoint during measurement can be stressed by concurrent request impacting the measurement. as we measure token count already, a solution might be to measure averages over time to get a better idea of individual endpoint capabilities... ... adding max_concurrent_connections on a per endpoint level will make this calculation even harder. an easy fix would be to assign probabilities: - endpoint 1 -> 70% - endpoint 2 -> 30% which is not ideal in some cases. looking forward to your comments to carve out a feature that will serve well most of the users.

JTHesse commented

2026-04-16 13:07:51 +02:00

Author

Firstly thank you for looking into this.
I can imagine that an ideal load balancing for different endpoints is pretty tricky.

Maybe prioritization should only occur if all endpoints are equally loaded. So in the case that the router can not decide between different endpoints and need to choose a random endpoint, a prioritization function could be accessed.
For the function I think two scenarios are possible:

Simply choose the first endpoint in the config.yaml
As you mentioned, measuring average t/s and sorting the endpoints based on that

I think this would be a good compromise between the current mechanism and a light prioritization.

By "fast" endpoint I was referring to a server that I know is faster due to better hardware only.

To clarify I would propose this balancing:

ollama1: 2 max_concurrent_connections
ollama2: 2 max_concurrent_connections
ollama3: 1 max_concurrent_connections
Model is loaded on all 3 endpoints.

1 request:

ollama1: 1 <- priorization
ollama2: 0
ollama3: 0

2 requests:

ollama1: 1
ollama2: 1 <- priorization
ollama3: 0

3 requests:

ollama1: 1
ollama2: 1
ollama3: 1

4 requests:

ollama1: 2 <- priorization
ollama2: 1
ollama3: 1

5 requests:

ollama1: 2
ollama2: 2 <- priorization
ollama3: 1

6 requests:

ollama1: 3 <- priorization & ollam3 full
ollama2: 2
ollama3: 1

8 requests:

ollama1: 3
ollama2: 3
ollama3: 2

Firstly thank you for looking into this. I can imagine that an ideal load balancing for different endpoints is pretty tricky. Maybe prioritization should only occur if all endpoints are equally loaded. So in the case that the router can not decide between different endpoints and need to choose a random endpoint, a prioritization function could be accessed. For the function I think two scenarios are possible: 1. Simply choose the first endpoint in the config.yaml 2. As you mentioned, measuring average t/s and sorting the endpoints based on that I think this would be a good compromise between the current mechanism and a light prioritization. By "fast" endpoint I was referring to a server that I know is faster due to better hardware only. To clarify I would propose this balancing: - ollama1: 2 max_concurrent_connections - ollama2: 2 max_concurrent_connections - ollama3: 1 max_concurrent_connections Model is loaded on all 3 endpoints. 1 request: - ollama1: 1 <- priorization - ollama2: 0 - ollama3: 0 2 requests: - ollama1: 1 - ollama2: 1 <- priorization - ollama3: 0 3 requests: - ollama1: 1 - ollama2: 1 - ollama3: 1 4 requests: - ollama1: 2 <- priorization - ollama2: 1 - ollama3: 1 5 requests: - ollama1: 2 - ollama2: 2 <- priorization - ollama3: 1 6 requests: - ollama1: 3 <- priorization & ollam3 full - ollama2: 2 - ollama3: 1 8 requests: - ollama1: 3 - ollama2: 3 - ollama3: 2

alpha-nerd commented

2026-04-16 13:40:39 +02:00

Owner

correct me if I am wrong, but what you describe is WRR (weighted round robin) mechanism.

the existing implementation would do:

2 requests:

ollama1: 2
ollama2: 0
ollama3: 0

thus preventing the cold-start period on ollama2 resulting very likely in a better time to first token with the caveat of slower token generation for this request as if it had been loaded to ollama2

don't know how current version of ollama is handling this, but other inference engines (i.e. llama.cpp, vllm) would also benefit from existing implementation for kv-caches.

implementing WRR is rather a simple task I could imagine to add as a configurable feature.
can you confirm WRR is what you looking for?

correct me if I am wrong, but what you describe is WRR ([weighted round robin](https://en.wikipedia.org/wiki/Weighted_round_robin)) mechanism. the existing implementation would do: 2 requests: ollama1: 2 ollama2: 0 ollama3: 0 thus preventing the cold-start period on ollama2 resulting very likely in a better time to first token with the caveat of slower token generation for this request as if it had been loaded to ollama2 don't know how current version of ollama is handling this, but other inference engines (i.e. llama.cpp, vllm) would also benefit from existing implementation for kv-caches. implementing WRR is rather a simple task I could imagine to add as a configurable feature. can you confirm WRR is what you looking for?

JTHesse commented

2026-04-16 13:57:21 +02:00

Author

Yes a WRR describes this quite good. However, only if the model is already loaded.

I guess what I am describing could look like this:

if all(tracking_usage(ep) == 0 for ep in loaded_and_free):
	 selected = loaded_and_free.sort(key=priorization)
else:
	 selected = loaded_and_free[0]

instead of:

if all(tracking_usage(ep) == 0 for ep in loaded_and_free):
	 selected = random.choice(loaded_and_free)
else:
	 selected = loaded_and_free[0]

The random.choice might choose endpoints that are slower.

Yes a WRR describes this quite good. However, only if the model is already loaded. I guess what I am describing could look like this: ``` if all(tracking_usage(ep) == 0 for ep in loaded_and_free): selected = loaded_and_free.sort(key=priorization) else: selected = loaded_and_free[0] ``` instead of: ``` if all(tracking_usage(ep) == 0 for ep in loaded_and_free): selected = random.choice(loaded_and_free) else: selected = loaded_and_free[0] ``` The random.choice might choose endpoints that are slower.

👍 1

alpha-nerd self-assigned this 2026-04-16 14:00:11 +02:00

alpha-nerd added the

enhancement

label 2026-04-16 14:00:25 +02:00

alpha-nerd added reference dev-v0.8.x

2026-04-16 14:00:34 +02:00

alpha-nerd referenced this issue from a commit

2026-04-22 17:27:42 +02:00

feat: enhance load balancing #23

alpha-nerd commented

2026-04-29 16:49:32 +02:00

Owner

tests looking good 💯
lookout for next PR @JTHesse

test/test_choose_endpoint.py::TestChooseEndpointBasic::test_selects_single_candidate PASSED [ 5%]
test/test_choose_endpoint.py::TestChooseEndpointBasic::test_raises_when_no_endpoint_has_model PASSED [ 10%]
test/test_choose_endpoint.py::TestChooseEndpointBasic::test_prefers_loaded_endpoint PASSED [ 15%]
test/test_choose_endpoint.py::TestChooseEndpointBasic::test_falls_back_to_free_slot PASSED [ 21%]
test/test_choose_endpoint.py::TestChooseEndpointBasic::test_saturated_picks_least_busy PASSED [ 26%]
test/test_choose_endpoint.py::TestChooseEndpointBasic::test_reserve_increments_usage PASSED [ 31%]
test/test_choose_endpoint.py::TestChooseEndpointModelNaming::test_strips_latest_for_openai_endpoints PASSED [ 36%]
test/test_choose_endpoint.py::TestChooseEndpointModelNaming::test_adds_latest_for_ollama_when_bare_name PASSED [ 42%]
test/test_choose_endpoint.py::TestChooseEndpointLoadBalancing::test_random_selection_among_idle PASSED [ 47%]
test/test_choose_endpoint.py::TestChooseEndpointLoadBalancing::test_sort_by_load_ascending PASSED [ 52%]
test/test_choose_endpoint.py::TestGetMaxConnections::test_returns_global_default_when_no_override PASSED [ 57%]
test/test_choose_endpoint.py::TestGetMaxConnections::test_returns_per_endpoint_override PASSED [ 63%]
test/test_choose_endpoint.py::TestGetMaxConnections::test_unrecognised_endpoint_falls_back_to_global PASSED [ 68%]
test/test_choose_endpoint.py::TestPriorityRouting::test_idle_picks_first_in_config_order PASSED [ 73%]
test/test_choose_endpoint.py::TestPriorityRouting::test_lower_utilization_preferred_over_priority PASSED [ 78%]
test/test_choose_endpoint.py::TestPriorityRouting::test_wrr_distribution_matches_expected_sequence PASSED [ 84%]
test/test_choose_endpoint.py::TestPriorityRouting::test_saturated_picks_lowest_ratio_then_priority PASSED [ 89%]
test/test_choose_endpoint.py::TestPriorityRouting::test_saturated_ties_broken_by_priority PASSED [ 94%]
test/test_choose_endpoint.py::TestPriorityRoutingDisabled::test_idle_endpoints_are_randomised PASSED [100%]

tests looking good 💯 lookout for next PR @JTHesse test/test_choose_endpoint.py::TestChooseEndpointBasic::test_selects_single_candidate PASSED [ 5%] test/test_choose_endpoint.py::TestChooseEndpointBasic::test_raises_when_no_endpoint_has_model PASSED [ 10%] test/test_choose_endpoint.py::TestChooseEndpointBasic::test_prefers_loaded_endpoint PASSED [ 15%] test/test_choose_endpoint.py::TestChooseEndpointBasic::test_falls_back_to_free_slot PASSED [ 21%] test/test_choose_endpoint.py::TestChooseEndpointBasic::test_saturated_picks_least_busy PASSED [ 26%] test/test_choose_endpoint.py::TestChooseEndpointBasic::test_reserve_increments_usage PASSED [ 31%] test/test_choose_endpoint.py::TestChooseEndpointModelNaming::test_strips_latest_for_openai_endpoints PASSED [ 36%] test/test_choose_endpoint.py::TestChooseEndpointModelNaming::test_adds_latest_for_ollama_when_bare_name PASSED [ 42%] test/test_choose_endpoint.py::TestChooseEndpointLoadBalancing::test_random_selection_among_idle PASSED [ 47%] test/test_choose_endpoint.py::TestChooseEndpointLoadBalancing::test_sort_by_load_ascending PASSED [ 52%] test/test_choose_endpoint.py::TestGetMaxConnections::test_returns_global_default_when_no_override PASSED [ 57%] test/test_choose_endpoint.py::TestGetMaxConnections::test_returns_per_endpoint_override PASSED [ 63%] test/test_choose_endpoint.py::TestGetMaxConnections::test_unrecognised_endpoint_falls_back_to_global PASSED [ 68%] test/test_choose_endpoint.py::TestPriorityRouting::test_idle_picks_first_in_config_order PASSED [ 73%] test/test_choose_endpoint.py::TestPriorityRouting::test_lower_utilization_preferred_over_priority PASSED [ 78%] test/test_choose_endpoint.py::TestPriorityRouting::test_wrr_distribution_matches_expected_sequence PASSED [ 84%] test/test_choose_endpoint.py::TestPriorityRouting::test_saturated_picks_lowest_ratio_then_priority PASSED [ 89%] test/test_choose_endpoint.py::TestPriorityRouting::test_saturated_ties_broken_by_priority PASSED [ 94%] test/test_choose_endpoint.py::TestPriorityRoutingDisabled::test_idle_endpoints_are_randomised PASSED [100%]

alpha-nerd closed this issue

2026-04-29 16:49:34 +02:00

JTHesse commented

2026-05-11 09:58:46 +02:00

Author

Sorry, I was on vacation. Awesome work, thank you!