Refactor endpoint selection logic to consistently use tracking model keys (normalized via `get_tracking_model`) instead of raw model names, ensuring usage counts are accurately compared with how increment/decrement operations store them. This fixes inconsistent load balancing and model affinity behavior caused by mismatches between raw and tracked model identifiers.
Introduce `get_tracking_model()` to standardize model names for consistent usage tracking in Prometheus metrics. This ensures llama-server models are stripped of HF prefixes and quantization suffixes, Ollama models append `:latest` when versionless, and external OpenAI models remain unchanged—aligning all tracking keys with the PS table.
Filter out non-string version responses (e.g., empty lists from failed requests) and return a 503 Service Unavailable error if no valid versions are received from any endpoint.
- Fetch `/props` endpoints in parallel to get context length and auto-unload sleeping models
- Add support for reasoning content and tool calls in streaming openai chat/completions responses
Adds lock-protected dictionaries to track running background refresh tasks, preventing duplicate executions per endpoint. Increases cache freshness thresholds from 30s to 300s to reduce blocking behavior.
fix: /v1 endpoints use correct media_types and usage information with proper logging
- updated reasoning handling
- improved model and error caches
- fixed openai tool calling incl. ollama translations
- direct support for llama.cpp's llama_server via llama_server_endpoint config
- basic llama_server model info in dashboard
- improved endpoint info fetching behaviour in error cases
- Increased error and loaded model cache freshness thresholds from 10s to 30s.
- Added `skip_error_cache` parameter to `endpoint_details` to prevent cached failures from blocking health checks.
- Implemented automatic error recording in `_available_error_cache` on API request failures.
Adds support for correctly handling tool calls in chat requests. Normalizes tool call data (ensuring IDs, types, and JSON arguments) in non-streaming mode and accumulates OpenAI-style deltas during streaming to build the final Ollama response.
Add `llama_server_endpoints` configuration field to support llama_server OpenAI-compatible endpoints for status checks. Implement helper functions to parse model names and quantization levels from llama-server responses (best effort). Update `is_ext_openai_endpoint` to properly distinguish these endpoints from external OpenAI services. Update sample configuration documentation.
Refactor error tracking to use separate caches for 'available' and 'loaded' models, preventing cross-contamination of transient errors. Implement background refresh for available models to prevent blocking requests, and use stale-while-revalidate (300-600s) to serve stale data immediately when the cache is between 300s and 600s old.
- Added proper API key validation in router.py with 401 response when key is missing
- Implemented CORS headers for authentication requests
- Updated table header from "Until" to "Unload" in static/index.html
- Improved security by preventing API key leakage in access logs
- Renamed Feedback class to follow PascalCase convention
- Fixed candidate enumeration start index from 0 to 1
- Simplified candidate content access by removing .message.content
- Updated CONFIG_PATH environment variable name to CONFIG_PATH_ARG
- Bumped version from 0.5 to 0.6
- Removed unnecessary return statement and trailing newline
Added in-flight request tracking mechanism to prevent cache stampede when multiple concurrent requests arrive for the same endpoint. This introduces new dictionaries to track ongoing requests and a lock to coordinate access. The available_models method was refactored to use an internal helper function and includes request coalescing logic to ensure only one HTTP request is made per endpoint when cache entries expire. The loaded_models method was also updated to use the new caching and coalescing pattern.
- Reduced min-width of model columns from 340px to 200px with max-width of 300px
- Added specific styling for narrow columns (3rd-5th) with fixed width and center alignment
- Removed "Instance count" as it has redundant information
- Enhanced time formatting logic to show relative time instead of absolute dates
- Simplified digest display to show last 6 characters instead of truncated format
- Added proper handling for various time value types (number, string, null)
Added endpoint differentiation for models PS board to see where which model is loaded and for how long to ease the viewing of multiple same models deployed for load balancing
- Added validation to check that the extracted key is not empty before returning it
- Added CORS headers to enforce_router_api_key() for proper cross-origin request handling and CORS-related error prevention
Create atomic snapshots by deep copying usage data structures to prevent race conditions.
Protect concurrent reads of usage counts with explicit locking in endpoint selection.
Replace README screenshot with a video link.
Optional router-level API key that gates router/API/web UI access (leave empty to disable)
## Supplying the router API key
If you set `nomyo-router-api-key` in `config.yaml` (or `NOMYO_ROUTER_API_KEY` env), every request to NOMYO Router must include the key:
- HTTP header (recommended): `Authorization: Bearer <router_key>`
- Query param (fallback): `?api_key=<router_key>`
Examples:
```bash
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
curl "http://localhost:12434/api/tags?api_key=$NOMYO_ROUTER_API_KEY"
```