Extract text parts from multimodal payloads (lists/dicts).
Skip image_url and other non-text types to ensure embedding
models receive compatible text-only input.
- added global for orphaned token_worker_task and flust_task
- fixed a regex to effectively _mask_secrets
- fixed several Type and KeyErrors
- fixed model deduplication for llama_server_endpoints
Add check for `status is None` in `_is_llama_model_loaded`.
Models without a status field (e.g., single-model servers) are
assumed to be always loaded rather than failing the check.
Also updated docstring to clarify this behavior.
Refactor endpoint selection logic to consistently use tracking model keys (normalized via `get_tracking_model`) instead of raw model names, ensuring usage counts are accurately compared with how increment/decrement operations store them. This fixes inconsistent load balancing and model affinity behavior caused by mismatches between raw and tracked model identifiers.
Introduce `get_tracking_model()` to standardize model names for consistent usage tracking in Prometheus metrics. This ensures llama-server models are stripped of HF prefixes and quantization suffixes, Ollama models append `:latest` when versionless, and external OpenAI models remain unchanged—aligning all tracking keys with the PS table.
Filter out non-string version responses (e.g., empty lists from failed requests) and return a 503 Service Unavailable error if no valid versions are received from any endpoint.
- Fetch `/props` endpoints in parallel to get context length and auto-unload sleeping models
- Add support for reasoning content and tool calls in streaming openai chat/completions responses
Adds lock-protected dictionaries to track running background refresh tasks, preventing duplicate executions per endpoint. Increases cache freshness thresholds from 30s to 300s to reduce blocking behavior.
fix: /v1 endpoints use correct media_types and usage information with proper logging
- updated reasoning handling
- improved model and error caches
- fixed openai tool calling incl. ollama translations
- direct support for llama.cpp's llama_server via llama_server_endpoint config
- basic llama_server model info in dashboard
- improved endpoint info fetching behaviour in error cases