feat: enhance load balancing #23
Labels
No labels
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: nomyo-ai/nomyo-router#23
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hi, I would like to propose two features for the nomyo-router.
Currently, the max_concurrent_connections can be configured for all endpoints only.
As there might be highly different endpoint configuration, a value for each endpoint would improve load balancing in my opinion.
In addition to this, a priority list could be added. For example, I am using those endpoints:
endpoints:
None of those endpoints is processing any requests. If a new request arises I would like to be sure that it is processed by the ollama0 endpoint and only if the max_concurrent_connections is reached on ollama0, ollama1 will be requested.
This would allow the use of a "fast" endpoint primarily. Maybe this could be switched on/off as some users might want to use a random approach.
Thanks!
the current load-balancing approach with model aware routing requires equally powerful endpoints.
as this is rather an ideal world scenario I like the idea of having a mechanism to support this.
however, I can imagine a scenario where the scenario you describe is preventing an efficient load-balancing strategy.
if ollama0 is running model X with max_concurrent_connects = 3 it might cause to never "overflow" to ollama1 thus leaving this endpoints resources (with different config or capacity) unused, which would be a waste of resources.
furthermore it might be more convenient to users to utilize rather 2 or maybe even 3 endpoints, because they all would be capable of serving model X, but it would be totally different play when serving bigger model Y.
when your aim is accessing a "fast" endpoint, because the model is already loaded and you want to prevent cold-start delay, then this is what the model-aware-routing functionality has been designed for.
if "fast" means a more capable endpoint being able to serve a given model faster, then we need to account for the relative difference token generation of the individual endpoints.
an approach would be to configure it.
another to measure it, which then can be tricky as the endpoint during measurement can be stressed by concurrent request impacting the measurement. as we measure token count already, a solution might be to measure averages over time to get a better idea of individual endpoint capabilities...
... adding max_concurrent_connections on a per endpoint level will make this calculation even harder.
an easy fix would be to assign probabilities:
which is not ideal in some cases.
looking forward to your comments to carve out a feature that will serve well most of the users.
Firstly thank you for looking into this.
I can imagine that an ideal load balancing for different endpoints is pretty tricky.
Maybe prioritization should only occur if all endpoints are equally loaded. So in the case that the router can not decide between different endpoints and need to choose a random endpoint, a prioritization function could be accessed.
For the function I think two scenarios are possible:
I think this would be a good compromise between the current mechanism and a light prioritization.
By "fast" endpoint I was referring to a server that I know is faster due to better hardware only.
To clarify I would propose this balancing:
Model is loaded on all 3 endpoints.
1 request:
2 requests:
3 requests:
4 requests:
5 requests:
6 requests:
8 requests:
correct me if I am wrong, but what you describe is WRR (weighted round robin) mechanism.
the existing implementation would do:
2 requests:
thus preventing the cold-start period on ollama2 resulting very likely in a better time to first token with the caveat of slower token generation for this request as if it had been loaded to ollama2
don't know how current version of ollama is handling this, but other inference engines (i.e. llama.cpp, vllm) would also benefit from existing implementation for kv-caches.
implementing WRR is rather a simple task I could imagine to add as a configurable feature.
can you confirm WRR is what you looking for?
Yes a WRR describes this quite good. However, only if the model is already loaded.
I guess what I am describing could look like this:
instead of:
The random.choice might choose endpoints that are slower.