Commit graph

217 commits

Author SHA1 Message Date
e416542bf8 fix: model name normalization for context_cash preemptive context-shifting for smaller context-windows with previous failure 2026-03-12 16:08:01 +01:00
be60a348e1 fix: changing error_cache to stale-while-revalidate same as available_models_cache 2026-03-12 14:47:54 +01:00
9acc37951a feat: add reactive auto context-shift in openai endpoints to prevent recover from out of context errors 2026-03-12 10:15:52 +01:00
95c643109a feat: add an openai retry if request with image is send to a pure text model 2026-03-12 10:06:18 +01:00
1ae989788b fix(router): normalize multimodal input to extract text for embeddings
Extract text parts from multimodal payloads (lists/dicts).
Skip image_url and other non-text types to ensure embedding
models receive compatible text-only input.
2026-03-11 16:41:21 +01:00
46da392a53 fix: semcache version pinned 2026-03-11 09:40:00 +01:00
fbdc73eebb fix: improvements, fixes and opt-in cache
doc: semantic-cache.md added with detailed write-up
2026-03-10 15:19:37 +01:00
a5108486e3 conf: clean default conf 2026-03-08 09:35:40 +01:00
e8b8981421 doc: updated usage.md 2026-03-08 09:26:53 +01:00
dd4b12da6a feat: adding a semantic cache layer 2026-03-08 09:12:09 +01:00
c3d47c7ffe docs: adding ghcr docker pull instructions 2026-03-05 11:54:42 +01:00
b951cc82e3 bump version 2026-03-05 11:09:20 +01:00
00a06dca51 feat: add docker publish workflow 2026-03-05 11:09:16 +01:00
8037706f0b fix(db.py): remove full table scans with proper where clauses for dashboard statistics and calc in db rather than python 2026-03-03 17:20:33 +01:00
45315790d1 fix(router.py):
- added global for orphaned token_worker_task and flust_task
- fixed a regex to effectively _mask_secrets
- fixed several Type and KeyErrors
- fixed model deduplication for llama_server_endpoints
2026-03-03 16:34:16 +01:00
e96e890511 refactor: make choose_endpoint use cache incrementer for atomic updates 2026-03-03 14:57:37 +01:00
e7196146ad feat: add uvloop to requirements.txt as optional dependency to improve performance in high concurrent scenarios 2026-03-03 10:31:10 +01:00
10c83c3e1e fix(router): treat missing status as loaded for llama model check
Add check for `status is None` in `_is_llama_model_loaded`.
Models without a status field (e.g., single-model servers) are
assumed to be always loaded rather than failing the check.
Also updated docstring to clarify this behavior.
2026-03-02 08:54:46 +01:00
cac0580eec feat: adding /v1/rerank endpoint with cohere,jina,llama.cpp compatibility 2026-02-28 09:31:25 +01:00
ad4a1d07b2 fix(/v1/embeddings): returning the async_gen forced FastAPI serialization which caused Pydantic Errors. Also sanizted nan/inf values to floats (0.0).
Use try - finally to properly decrement usage counters in case of error.
2026-02-27 16:39:27 +01:00
d2ea65f74a fix(router): use normalized model keys for endpoint selection
Refactor endpoint selection logic to consistently use tracking model keys (normalized via `get_tracking_model`) instead of raw model names, ensuring usage counts are accurately compared with how increment/decrement operations store them. This fixes inconsistent load balancing and model affinity behavior caused by mismatches between raw and tracked model identifiers.
2026-02-19 17:32:54 +01:00
07751ddd3b fix: endpoint selection logic again 2026-02-19 10:11:53 +01:00
7cba67cce0 feat(router): normalize model names for usage tracking across endpoints (continued)
Introduce `get_tracking_model()` to standardize model names for consistent usage tracking in Prometheus metrics. This ensures llama-server models are stripped of HF prefixes and quantization suffixes, Ollama models append `:latest` when versionless, and external OpenAI models remain unchanged—aligning all tracking keys with the PS table.
2026-02-18 11:45:37 +01:00
b2980a7d24 fix(router): handle invalid version responses with 503 error
Filter out non-string version responses (e.g., empty lists from failed requests) and return a 503 Service Unavailable error if no valid versions are received from any endpoint.
2026-02-17 15:56:09 +01:00
836c5f41ea fix(router): normalize model names for usage tracking across endpoints 2026-02-17 11:35:53 +01:00
372fe9fb72 feat(router): parallelize llama-server props fetch and add reasoning/tool call support
- Fetch `/props` endpoints in parallel to get context length and auto-unload sleeping models
- Add support for reasoning content and tool calls in streaming openai chat/completions responses
2026-02-15 17:05:35 +01:00
4d40048fd2 fix: loaded_models_cache timing restored 2026-02-15 12:15:36 +01:00
0bad604b02 feat: deduplicate background refresh tasks and extend cache TTL
Adds lock-protected dictionaries to track running background refresh tasks, preventing duplicate executions per endpoint. Increases cache freshness thresholds from 30s to 300s to reduce blocking behavior.

fix: /v1 endpoints use correct media_types and usage information with proper logging
2026-02-14 14:51:44 +01:00
c9ff384bb2 fix(router): /v1/models endpoint
Shows now all available models
2026-02-13 16:27:06 +01:00
4d80dc5e7c feat: adding logprobs to /v1/chat/completion 2026-02-13 14:43:10 +01:00
eda48562da feat(router): add logprob support in /api/chat
Add logprob support to the OpenAI-to-Ollama proxy by converting OpenAI logprob formats to Ollama types. Also update the ollama dependency.
2026-02-13 13:29:45 +01:00
07af6e2e36 fix: better sample config 2026-02-13 10:52:14 +01:00
9ef1b770ba
Merge pull request #25 from nomyo-ai/dev-v0.6
- updated reasoning handling
- improved model and error caches
- fixed openai tool calling incl. ollama translations
- direct support for llama.cpp's llama_server via llama_server_endpoint config
- basic llama_server model info in dashboard
- improved endpoint info fetching behaviour in error cases
2026-02-13 10:34:42 +01:00
1b355d8435
Merge branch 'main' into dev-v0.6 2026-02-13 10:33:36 +01:00
c545f413a5
Merge pull request #23 from JTHesse/main
Fix for SSL verification and SQL Bug
2026-02-13 10:19:07 +01:00
08b77428b8 refactor(router): bump cache TTLs and skip error cache for health checks
- Increased error and loaded model cache freshness thresholds from 10s to 30s.
- Added `skip_error_cache` parameter to `endpoint_details` to prevent cached failures from blocking health checks.
- Implemented automatic error recording in `_available_error_cache` on API request failures.
2026-02-13 10:11:41 +01:00
f7ef413090 replays 3af166c8a4 to grant merge into main 2026-02-12 16:28:40 +01:00
b649dcd8d6 proposal: use global truststore ctx for all connections 2026-02-12 16:15:39 +01:00
5c4e1e81a6
Merge pull request #24 from nomyo-ai:dependabot/pip/pillow-12.1.1
Bump pillow from 11.3.0 to 12.1.1
2026-02-12 15:58:33 +01:00
dependabot[bot]
99f5a3bc91
Bump pillow from 11.3.0 to 12.1.1
Bumps [pillow](https://github.com/python-pillow/Pillow) from 11.3.0 to 12.1.1.
- [Release notes](https://github.com/python-pillow/Pillow/releases)
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst)
- [Commits](https://github.com/python-pillow/Pillow/compare/11.3.0...12.1.1)

---
updated-dependencies:
- dependency-name: pillow
  dependency-version: 12.1.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-02-11 17:43:19 +00:00
dd30ab9422 fix SSL: CERTIFICATE_VERIFY_FAILED 2026-02-11 13:47:11 +01:00
3af166c8a4 fix sqlite3.OperationalError: no such table: main.token_time_series 2026-02-11 13:46:37 +01:00
9875eb977a feat: Add tool call normalization and streaming delta accumulation
Adds support for correctly handling tool calls in chat requests. Normalizes tool call data (ensuring IDs, types, and JSON arguments) in non-streaming mode and accumulates OpenAI-style deltas during streaming to build the final Ollama response.
2026-02-10 20:21:46 +01:00
4892998abc feat(router): Add llama-server endpoints support and model parsing
Add `llama_server_endpoints` configuration field to support llama_server OpenAI-compatible endpoints for status checks. Implement helper functions to parse model names and quantization levels from llama-server responses (best effort). Update `is_ext_openai_endpoint` to properly distinguish these endpoints from external OpenAI services. Update sample configuration documentation.
2026-02-10 16:46:51 +01:00
1f81e69ce1 refactor(router.py): correctly implement OpenAI tool_calls to Ollama format conversion 2026-02-09 11:04:14 +01:00
7deb088c6a refactor(cache): split error cache and add stale-while-revalidate
Refactor error tracking to use separate caches for 'available' and 'loaded' models, preventing cross-contamination of transient errors. Implement background refresh for available models to prevent blocking requests, and use stale-while-revalidate (300-600s) to serve stale data immediately when the cache is between 300s and 600s old.
2026-02-08 16:46:40 +01:00
92cea1dead feat: update reasoning handling
Updated reasoning content handling in router.py to check for both "reasoning_content" and "reasoning" attributes.
2026-02-08 11:29:47 +01:00
bd0d210b2a feat: enforce api key authentication and update table header
- Added proper API key validation in router.py with 401 response when key is missing
- Implemented CORS headers for authentication requests
- Updated table header from "Until" to "Unload" in static/index.html
- Improved security by preventing API key leakage in access logs
2026-02-01 10:05:46 +01:00
b718d575b7
Merge pull request #22 from nomyo-ai/dev-v0.5.x
Dev v0.5.x
2026-01-30 18:18:42 +01:00
d80b29e4f2 feat: enhance code quality and documentation
- Renamed Feedback class to follow PascalCase convention
- Fixed candidate enumeration start index from 0 to 1
- Simplified candidate content access by removing .message.content
- Updated CONFIG_PATH environment variable name to CONFIG_PATH_ARG
- Bumped version from 0.5 to 0.6
- Removed unnecessary return statement and trailing newline
2026-01-29 19:59:08 +01:00