Bug: Endpoint config error status not acknowledged

JTHesse commented

2026-05-18 10:56:42 +02:00

Hi there, I observed that a few completion calls returned empty, looked at the nomyo dashboard and noticed that one endpoint showed an error status.
The log shows:

INFO:     192.168.4.9:54078 - "GET /api/tags HTTP/1.1" 200 OK
[fetch.endpoint_details] Error while contacting http://ollama:8383/api/ps: 502: <html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.27.5</center>
</body>
</html>

This is expected, however calling the health function the same endpoint shows a different status.

{
  "status": "ok",
  "endpoints": {
    "http://ollama:8383": {
      "status": "ok",
      "version": []
    }
  }
}

Therefore, I assume that the error status observed is not acknowledged by the health endpoint and can also lead to dead end completion calls. Unfortunately this only occurred on our production deployment, therefore I wasn't able to further identify the issue.

Hi there, I observed that a few completion calls returned empty, looked at the nomyo dashboard and noticed that one endpoint showed an error status. The log shows: ``` INFO: 192.168.4.9:54078 - "GET /api/tags HTTP/1.1" 200 OK [fetch.endpoint_details] Error while contacting http://ollama:8383/api/ps: 502: <html> <head><title>502 Bad Gateway</title></head> <body> <center><h1>502 Bad Gateway</h1></center> <hr><center>nginx/1.27.5</center> </body> </html> ``` This is expected, however calling the health function the same endpoint shows a different status. ``` { "status": "ok", "endpoints": { "http://ollama:8383": { "status": "ok", "version": [] } } } ``` Therefore, I assume that the error status observed is not acknowledged by the health endpoint and can also lead to dead end completion calls. Unfortunately this only occurred on our production deployment, therefore I wasn't able to further identify the issue.

alpha-nerd commented

2026-05-18 12:46:55 +02:00

Owner

/oc investigate root cause

alpha-nerd self-assigned this 2026-05-18 12:47:05 +02:00

alpha-nerd commented

2026-05-18 12:47:18 +02:00

Owner

I now have a complete understanding of the root cause. Here's the analysis:

Root Cause Analysis

The Bug Flow

1. Health check and routing query different endpoints

/health (line 4025) queries /api/version → succeeds → reports "status": "ok"
Dashboard /api/config (line 3173) queries /api/version → succeeds → reports "status": "ok"
fetch.loaded_models() (line 1024) queries /api/ps → 502 Bad Gateway → logs the error

2. loaded_models() silently returns empty set on failure

At router.py:1036-1039:

except Exception as e:
    # If anything goes wrong we simply assume the endpoint has no models
    message = _format_connection_issue(f"{endpoint}/api/ps", e)
    print(f"[fetch.loaded_models] {message}")
    return set()

When /api/ps fails, loaded_models() returns set() — an empty set meaning "no models loaded".

3. Error is cached for 300 seconds

At router.py:1094-1097, the _loaded_error_cache caches the failure for 300s:

async with _loaded_error_cache_lock:
    if endpoint in _loaded_error_cache:
        if _is_fresh(_loaded_error_cache[endpoint], 300):
            return set()

During this window, loaded_models() returns set() without making any HTTP request.

4. Routing algorithm is poisoned

In choose_endpoint() step 3 (router.py:1901-1904):

loaded_and_free = [
    ep for ep, models in zip(candidate_endpoints, loaded_sets)
    if model in models and tracking_usage(ep) < get_max_connections(ep)
]

Since loaded_sets contains set() for the failed endpoint, model in models is always False. No endpoints pass the "loaded AND free" filter.

5. Fallback routes to unloaded endpoints

The algorithm falls through to step 4 (router.py:1924-1928):

endpoints_with_free_slot = [
    ep for ep in candidate_endpoints
    if tracking_usage(ep) < get_max_connections(ep)
]

This only checks for free slots, not whether the model is loaded. The router sends the request to an endpoint that doesn't have the model loaded → empty response.

Summary

Component	Queries	Tolerance
`/health`	`/api/version`	Reports per-endpoint status
`/api/config`	`/api/version`	Reports per-endpoint status
`loaded_models()`	`/api/ps`	Silently returns `set()`, cached 300s
`choose_endpoint()`	depends on `loaded_models()`	Makes routing decisions on stale failure data

The mismatch: Health check and dashboard query /api/version (which works), while routing depends on /api/ps (which fails). The 300s error cache in loaded_models() means the router continues making incorrect routing decisions for 5 minutes after the /api/ps endpoint recovers.

Key Files

router.py:1063-1119 — fetch.loaded_models() with error cache
router.py:1094-1097 — _loaded_error_cache short-circuit (300s TTL)
router.py:1901-1904 — Step 3: "loaded AND free" filter
router.py:1924-1928 — Step 4 fallback: free slot only (no loaded check)
router.py:4025 — Health check uses /api/version (not /api/ps)

opencode session | forgejo run

I now have a complete understanding of the root cause. Here's the analysis: ## Root Cause Analysis ### The Bug Flow **1. Health check and routing query different endpoints** - `/health` (line 4025) queries `/api/version` → succeeds → reports `"status": "ok"` - Dashboard `/api/config` (line 3173) queries `/api/version` → succeeds → reports `"status": "ok"` - `fetch.loaded_models()` (line 1024) queries `/api/ps` → **502 Bad Gateway** → logs the error **2. `loaded_models()` silently returns empty set on failure** At `router.py:1036-1039`: ```python except Exception as e: # If anything goes wrong we simply assume the endpoint has no models message = _format_connection_issue(f"{endpoint}/api/ps", e) print(f"[fetch.loaded_models] {message}") return set() ``` When `/api/ps` fails, `loaded_models()` returns `set()` — an empty set meaning "no models loaded". **3. Error is cached for 300 seconds** At `router.py:1094-1097`, the `_loaded_error_cache` caches the failure for 300s: ```python async with _loaded_error_cache_lock: if endpoint in _loaded_error_cache: if _is_fresh(_loaded_error_cache[endpoint], 300): return set() ``` During this window, `loaded_models()` returns `set()` **without making any HTTP request**. **4. Routing algorithm is poisoned** In `choose_endpoint()` step 3 (`router.py:1901-1904`): ```python loaded_and_free = [ ep for ep, models in zip(candidate_endpoints, loaded_sets) if model in models and tracking_usage(ep) < get_max_connections(ep) ] ``` Since `loaded_sets` contains `set()` for the failed endpoint, `model in models` is always `False`. No endpoints pass the "loaded AND free" filter. **5. Fallback routes to unloaded endpoints** The algorithm falls through to step 4 (`router.py:1924-1928`): ```python endpoints_with_free_slot = [ ep for ep in candidate_endpoints if tracking_usage(ep) < get_max_connections(ep) ] ``` This only checks for free slots, **not** whether the model is loaded. The router sends the request to an endpoint that doesn't have the model loaded → **empty response**. ### Summary | Component | Queries | Tolerance | |-----------|---------|-----------| | `/health` | `/api/version` | Reports per-endpoint status | | `/api/config` | `/api/version` | Reports per-endpoint status | | `loaded_models()` | `/api/ps` | Silently returns `set()`, cached 300s | | `choose_endpoint()` | depends on `loaded_models()` | Makes routing decisions on stale failure data | **The mismatch**: Health check and dashboard query `/api/version` (which works), while routing depends on `/api/ps` (which fails). The 300s error cache in `loaded_models()` means the router continues making incorrect routing decisions for 5 minutes after the `/api/ps` endpoint recovers. ### Key Files - `router.py:1063-1119` — `fetch.loaded_models()` with error cache - `router.py:1094-1097` — `_loaded_error_cache` short-circuit (300s TTL) - `router.py:1901-1904` — Step 3: "loaded AND free" filter - `router.py:1924-1928` — Step 4 fallback: free slot only (no loaded check) - `router.py:4025` — Health check uses `/api/version` (not `/api/ps`) [opencode session](https://opencode.ai/s/N2vI0wHY)  |  [forgejo run](https://bitfreedom.net/code/nomyo-ai/nomyo-router/actions/runs/185)

👍 1

JTHesse commented

2026-05-18 13:08:24 +02:00

Author

Thank you for the quick response. This explains the behavior, can this be fixed in a new release?