- added global for orphaned token_worker_task and flust_task
- fixed a regex to effectively _mask_secrets
- fixed several Type and KeyErrors
- fixed model deduplication for llama_server_endpoints
Add check for `status is None` in `_is_llama_model_loaded`.
Models without a status field (e.g., single-model servers) are
assumed to be always loaded rather than failing the check.
Also updated docstring to clarify this behavior.
Refactor endpoint selection logic to consistently use tracking model keys (normalized via `get_tracking_model`) instead of raw model names, ensuring usage counts are accurately compared with how increment/decrement operations store them. This fixes inconsistent load balancing and model affinity behavior caused by mismatches between raw and tracked model identifiers.
Introduce `get_tracking_model()` to standardize model names for consistent usage tracking in Prometheus metrics. This ensures llama-server models are stripped of HF prefixes and quantization suffixes, Ollama models append `:latest` when versionless, and external OpenAI models remain unchanged—aligning all tracking keys with the PS table.
Filter out non-string version responses (e.g., empty lists from failed requests) and return a 503 Service Unavailable error if no valid versions are received from any endpoint.
- Fetch `/props` endpoints in parallel to get context length and auto-unload sleeping models
- Add support for reasoning content and tool calls in streaming openai chat/completions responses
Adds lock-protected dictionaries to track running background refresh tasks, preventing duplicate executions per endpoint. Increases cache freshness thresholds from 30s to 300s to reduce blocking behavior.
fix: /v1 endpoints use correct media_types and usage information with proper logging
- updated reasoning handling
- improved model and error caches
- fixed openai tool calling incl. ollama translations
- direct support for llama.cpp's llama_server via llama_server_endpoint config
- basic llama_server model info in dashboard
- improved endpoint info fetching behaviour in error cases
- Increased error and loaded model cache freshness thresholds from 10s to 30s.
- Added `skip_error_cache` parameter to `endpoint_details` to prevent cached failures from blocking health checks.
- Implemented automatic error recording in `_available_error_cache` on API request failures.
Adds support for correctly handling tool calls in chat requests. Normalizes tool call data (ensuring IDs, types, and JSON arguments) in non-streaming mode and accumulates OpenAI-style deltas during streaming to build the final Ollama response.
Add `llama_server_endpoints` configuration field to support llama_server OpenAI-compatible endpoints for status checks. Implement helper functions to parse model names and quantization levels from llama-server responses (best effort). Update `is_ext_openai_endpoint` to properly distinguish these endpoints from external OpenAI services. Update sample configuration documentation.
Refactor error tracking to use separate caches for 'available' and 'loaded' models, preventing cross-contamination of transient errors. Implement background refresh for available models to prevent blocking requests, and use stale-while-revalidate (300-600s) to serve stale data immediately when the cache is between 300s and 600s old.
- Added proper API key validation in router.py with 401 response when key is missing
- Implemented CORS headers for authentication requests
- Updated table header from "Until" to "Unload" in static/index.html
- Improved security by preventing API key leakage in access logs
- Renamed Feedback class to follow PascalCase convention
- Fixed candidate enumeration start index from 0 to 1
- Simplified candidate content access by removing .message.content
- Updated CONFIG_PATH environment variable name to CONFIG_PATH_ARG
- Bumped version from 0.5 to 0.6
- Removed unnecessary return statement and trailing newline
Added in-flight request tracking mechanism to prevent cache stampede when multiple concurrent requests arrive for the same endpoint. This introduces new dictionaries to track ongoing requests and a lock to coordinate access. The available_models method was refactored to use an internal helper function and includes request coalescing logic to ensure only one HTTP request is made per endpoint when cache entries expire. The loaded_models method was also updated to use the new caching and coalescing pattern.
- Reduced min-width of model columns from 340px to 200px with max-width of 300px
- Added specific styling for narrow columns (3rd-5th) with fixed width and center alignment
- Removed "Instance count" as it has redundant information
- Enhanced time formatting logic to show relative time instead of absolute dates
- Simplified digest display to show last 6 characters instead of truncated format
- Added proper handling for various time value types (number, string, null)