Bug: Endpoint config error status not acknowledged #83
Labels
No labels
bug
dependencies
duplicate
enhancement
help wanted
invalid
question
renovate: stop-updating
wontfix
bug
duplicate
enhancement
help wanted
invalid
question
renovate: stop-updating
security
wontfix
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: nomyo-ai/nomyo-router#83
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hi there, I observed that a few completion calls returned empty, looked at the nomyo dashboard and noticed that one endpoint showed an error status.
The log shows:
This is expected, however calling the health function the same endpoint shows a different status.
Therefore, I assume that the error status observed is not acknowledged by the health endpoint and can also lead to dead end completion calls. Unfortunately this only occurred on our production deployment, therefore I wasn't able to further identify the issue.
/oc investigate root cause
I now have a complete understanding of the root cause. Here's the analysis:
Root Cause Analysis
The Bug Flow
1. Health check and routing query different endpoints
/health(line 4025) queries/api/version→ succeeds → reports"status": "ok"/api/config(line 3173) queries/api/version→ succeeds → reports"status": "ok"fetch.loaded_models()(line 1024) queries/api/ps→ 502 Bad Gateway → logs the error2.
loaded_models()silently returns empty set on failureAt
router.py:1036-1039:When
/api/psfails,loaded_models()returnsset()— an empty set meaning "no models loaded".3. Error is cached for 300 seconds
At
router.py:1094-1097, the_loaded_error_cachecaches the failure for 300s:During this window,
loaded_models()returnsset()without making any HTTP request.4. Routing algorithm is poisoned
In
choose_endpoint()step 3 (router.py:1901-1904):Since
loaded_setscontainsset()for the failed endpoint,model in modelsis alwaysFalse. No endpoints pass the "loaded AND free" filter.5. Fallback routes to unloaded endpoints
The algorithm falls through to step 4 (
router.py:1924-1928):This only checks for free slots, not whether the model is loaded. The router sends the request to an endpoint that doesn't have the model loaded → empty response.
Summary
/health/api/version/api/config/api/versionloaded_models()/api/psset(), cached 300schoose_endpoint()loaded_models()The mismatch: Health check and dashboard query
/api/version(which works), while routing depends on/api/ps(which fails). The 300s error cache inloaded_models()means the router continues making incorrect routing decisions for 5 minutes after the/api/psendpoint recovers.Key Files
router.py:1063-1119—fetch.loaded_models()with error cacherouter.py:1094-1097—_loaded_error_cacheshort-circuit (300s TTL)router.py:1901-1904— Step 3: "loaded AND free" filterrouter.py:1924-1928— Step 4 fallback: free slot only (no loaded check)router.py:4025— Health check uses/api/version(not/api/ps)opencode session | forgejo run
Thank you for the quick response. This explains the behavior, can this be fixed in a new release?
WIP
oc bot was almost right with the root cause.
a fix is being prepared and committed almost now.
@JTHesse do you want to checkout PR #85?
policy requires the PR branch needs to run successfully in our staging environment before merge is possible
latest docker images have been released
v0.9.1 docker images have been released
Thank you, works now!