Bug: Endpoint config error status not acknowledged #83

Closed
opened 2026-05-18 10:56:42 +02:00 by JTHesse · 9 comments

Hi there, I observed that a few completion calls returned empty, looked at the nomyo dashboard and noticed that one endpoint showed an error status.
The log shows:

INFO:     192.168.4.9:54078 - "GET /api/tags HTTP/1.1" 200 OK
[fetch.endpoint_details] Error while contacting http://ollama:8383/api/ps: 502: <html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.27.5</center>
</body>
</html>

This is expected, however calling the health function the same endpoint shows a different status.

{
  "status": "ok",
  "endpoints": {
    "http://ollama:8383": {
      "status": "ok",
      "version": []
    }
  }
}

Therefore, I assume that the error status observed is not acknowledged by the health endpoint and can also lead to dead end completion calls. Unfortunately this only occurred on our production deployment, therefore I wasn't able to further identify the issue.

Hi there, I observed that a few completion calls returned empty, looked at the nomyo dashboard and noticed that one endpoint showed an error status. The log shows: ``` INFO: 192.168.4.9:54078 - "GET /api/tags HTTP/1.1" 200 OK [fetch.endpoint_details] Error while contacting http://ollama:8383/api/ps: 502: <html> <head><title>502 Bad Gateway</title></head> <body> <center><h1>502 Bad Gateway</h1></center> <hr><center>nginx/1.27.5</center> </body> </html> ``` This is expected, however calling the health function the same endpoint shows a different status. ``` { "status": "ok", "endpoints": { "http://ollama:8383": { "status": "ok", "version": [] } } } ``` Therefore, I assume that the error status observed is not acknowledged by the health endpoint and can also lead to dead end completion calls. Unfortunately this only occurred on our production deployment, therefore I wasn't able to further identify the issue.
Owner

/oc investigate root cause

/oc investigate root cause
alpha-nerd self-assigned this 2026-05-18 12:47:05 +02:00
Owner

I now have a complete understanding of the root cause. Here's the analysis:

Root Cause Analysis

The Bug Flow

1. Health check and routing query different endpoints

  • /health (line 4025) queries /api/version → succeeds → reports "status": "ok"
  • Dashboard /api/config (line 3173) queries /api/version → succeeds → reports "status": "ok"
  • fetch.loaded_models() (line 1024) queries /api/ps502 Bad Gateway → logs the error

2. loaded_models() silently returns empty set on failure

At router.py:1036-1039:

except Exception as e:
    # If anything goes wrong we simply assume the endpoint has no models
    message = _format_connection_issue(f"{endpoint}/api/ps", e)
    print(f"[fetch.loaded_models] {message}")
    return set()

When /api/ps fails, loaded_models() returns set() — an empty set meaning "no models loaded".

3. Error is cached for 300 seconds

At router.py:1094-1097, the _loaded_error_cache caches the failure for 300s:

async with _loaded_error_cache_lock:
    if endpoint in _loaded_error_cache:
        if _is_fresh(_loaded_error_cache[endpoint], 300):
            return set()

During this window, loaded_models() returns set() without making any HTTP request.

4. Routing algorithm is poisoned

In choose_endpoint() step 3 (router.py:1901-1904):

loaded_and_free = [
    ep for ep, models in zip(candidate_endpoints, loaded_sets)
    if model in models and tracking_usage(ep) < get_max_connections(ep)
]

Since loaded_sets contains set() for the failed endpoint, model in models is always False. No endpoints pass the "loaded AND free" filter.

5. Fallback routes to unloaded endpoints

The algorithm falls through to step 4 (router.py:1924-1928):

endpoints_with_free_slot = [
    ep for ep in candidate_endpoints
    if tracking_usage(ep) < get_max_connections(ep)
]

This only checks for free slots, not whether the model is loaded. The router sends the request to an endpoint that doesn't have the model loaded → empty response.

Summary

Component Queries Tolerance
/health /api/version Reports per-endpoint status
/api/config /api/version Reports per-endpoint status
loaded_models() /api/ps Silently returns set(), cached 300s
choose_endpoint() depends on loaded_models() Makes routing decisions on stale failure data

The mismatch: Health check and dashboard query /api/version (which works), while routing depends on /api/ps (which fails). The 300s error cache in loaded_models() means the router continues making incorrect routing decisions for 5 minutes after the /api/ps endpoint recovers.

Key Files

  • router.py:1063-1119fetch.loaded_models() with error cache
  • router.py:1094-1097_loaded_error_cache short-circuit (300s TTL)
  • router.py:1901-1904 — Step 3: "loaded AND free" filter
  • router.py:1924-1928 — Step 4 fallback: free slot only (no loaded check)
  • router.py:4025 — Health check uses /api/version (not /api/ps)

opencode session  |  forgejo run

I now have a complete understanding of the root cause. Here's the analysis: ## Root Cause Analysis ### The Bug Flow **1. Health check and routing query different endpoints** - `/health` (line 4025) queries `/api/version` → succeeds → reports `"status": "ok"` - Dashboard `/api/config` (line 3173) queries `/api/version` → succeeds → reports `"status": "ok"` - `fetch.loaded_models()` (line 1024) queries `/api/ps` → **502 Bad Gateway** → logs the error **2. `loaded_models()` silently returns empty set on failure** At `router.py:1036-1039`: ```python except Exception as e: # If anything goes wrong we simply assume the endpoint has no models message = _format_connection_issue(f"{endpoint}/api/ps", e) print(f"[fetch.loaded_models] {message}") return set() ``` When `/api/ps` fails, `loaded_models()` returns `set()` — an empty set meaning "no models loaded". **3. Error is cached for 300 seconds** At `router.py:1094-1097`, the `_loaded_error_cache` caches the failure for 300s: ```python async with _loaded_error_cache_lock: if endpoint in _loaded_error_cache: if _is_fresh(_loaded_error_cache[endpoint], 300): return set() ``` During this window, `loaded_models()` returns `set()` **without making any HTTP request**. **4. Routing algorithm is poisoned** In `choose_endpoint()` step 3 (`router.py:1901-1904`): ```python loaded_and_free = [ ep for ep, models in zip(candidate_endpoints, loaded_sets) if model in models and tracking_usage(ep) < get_max_connections(ep) ] ``` Since `loaded_sets` contains `set()` for the failed endpoint, `model in models` is always `False`. No endpoints pass the "loaded AND free" filter. **5. Fallback routes to unloaded endpoints** The algorithm falls through to step 4 (`router.py:1924-1928`): ```python endpoints_with_free_slot = [ ep for ep in candidate_endpoints if tracking_usage(ep) < get_max_connections(ep) ] ``` This only checks for free slots, **not** whether the model is loaded. The router sends the request to an endpoint that doesn't have the model loaded → **empty response**. ### Summary | Component | Queries | Tolerance | |-----------|---------|-----------| | `/health` | `/api/version` | Reports per-endpoint status | | `/api/config` | `/api/version` | Reports per-endpoint status | | `loaded_models()` | `/api/ps` | Silently returns `set()`, cached 300s | | `choose_endpoint()` | depends on `loaded_models()` | Makes routing decisions on stale failure data | **The mismatch**: Health check and dashboard query `/api/version` (which works), while routing depends on `/api/ps` (which fails). The 300s error cache in `loaded_models()` means the router continues making incorrect routing decisions for 5 minutes after the `/api/ps` endpoint recovers. ### Key Files - `router.py:1063-1119` — `fetch.loaded_models()` with error cache - `router.py:1094-1097` — `_loaded_error_cache` short-circuit (300s TTL) - `router.py:1901-1904` — Step 3: "loaded AND free" filter - `router.py:1924-1928` — Step 4 fallback: free slot only (no loaded check) - `router.py:4025` — Health check uses `/api/version` (not `/api/ps`) [opencode session](https://opencode.ai/s/N2vI0wHY)&nbsp;&nbsp;|&nbsp;&nbsp;[forgejo run](https://bitfreedom.net/code/nomyo-ai/nomyo-router/actions/runs/185)
Author

Thank you for the quick response. This explains the behavior, can this be fixed in a new release?

Thank you for the quick response. This explains the behavior, can this be fixed in a new release?
Owner

WIP

WIP
alpha-nerd added the
bug
label 2026-05-18 13:41:31 +02:00
Owner

oc bot was almost right with the root cause.
a fix is being prepared and committed almost now.

oc bot was almost right with the root cause. a fix is being prepared and committed almost now.
alpha-nerd referenced this issue from a commit 2026-05-18 13:45:16 +02:00
Owner

@JTHesse do you want to checkout PR #85?
policy requires the PR branch needs to run successfully in our staging environment before merge is possible

@JTHesse do you want to checkout PR #85? policy requires the PR branch needs to run successfully in our staging environment before merge is possible
Owner

latest docker images have been released

latest docker images have been released
Owner

v0.9.1 docker images have been released

v0.9.1 docker images have been released
Author

Thank you, works now!

Thank you, works now!
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: nomyo-ai/nomyo-router#83
No description provided.