Commit graph

198 commits

Author SHA1 Message Date
ad4a1d07b2 fix(/v1/embeddings): returning the async_gen forced FastAPI serialization which caused Pydantic Errors. Also sanizted nan/inf values to floats (0.0).
Use try - finally to properly decrement usage counters in case of error.
2026-02-27 16:39:27 +01:00
d2ea65f74a fix(router): use normalized model keys for endpoint selection
Refactor endpoint selection logic to consistently use tracking model keys (normalized via `get_tracking_model`) instead of raw model names, ensuring usage counts are accurately compared with how increment/decrement operations store them. This fixes inconsistent load balancing and model affinity behavior caused by mismatches between raw and tracked model identifiers.
2026-02-19 17:32:54 +01:00
07751ddd3b fix: endpoint selection logic again 2026-02-19 10:11:53 +01:00
7cba67cce0 feat(router): normalize model names for usage tracking across endpoints (continued)
Introduce `get_tracking_model()` to standardize model names for consistent usage tracking in Prometheus metrics. This ensures llama-server models are stripped of HF prefixes and quantization suffixes, Ollama models append `:latest` when versionless, and external OpenAI models remain unchanged—aligning all tracking keys with the PS table.
2026-02-18 11:45:37 +01:00
b2980a7d24 fix(router): handle invalid version responses with 503 error
Filter out non-string version responses (e.g., empty lists from failed requests) and return a 503 Service Unavailable error if no valid versions are received from any endpoint.
2026-02-17 15:56:09 +01:00
836c5f41ea fix(router): normalize model names for usage tracking across endpoints 2026-02-17 11:35:53 +01:00
372fe9fb72 feat(router): parallelize llama-server props fetch and add reasoning/tool call support
- Fetch `/props` endpoints in parallel to get context length and auto-unload sleeping models
- Add support for reasoning content and tool calls in streaming openai chat/completions responses
2026-02-15 17:05:35 +01:00
4d40048fd2 fix: loaded_models_cache timing restored 2026-02-15 12:15:36 +01:00
0bad604b02 feat: deduplicate background refresh tasks and extend cache TTL
Adds lock-protected dictionaries to track running background refresh tasks, preventing duplicate executions per endpoint. Increases cache freshness thresholds from 30s to 300s to reduce blocking behavior.

fix: /v1 endpoints use correct media_types and usage information with proper logging
2026-02-14 14:51:44 +01:00
c9ff384bb2 fix(router): /v1/models endpoint
Shows now all available models
2026-02-13 16:27:06 +01:00
4d80dc5e7c feat: adding logprobs to /v1/chat/completion 2026-02-13 14:43:10 +01:00
eda48562da feat(router): add logprob support in /api/chat
Add logprob support to the OpenAI-to-Ollama proxy by converting OpenAI logprob formats to Ollama types. Also update the ollama dependency.
2026-02-13 13:29:45 +01:00
07af6e2e36 fix: better sample config 2026-02-13 10:52:14 +01:00
9ef1b770ba
Merge pull request #25 from nomyo-ai/dev-v0.6
- updated reasoning handling
- improved model and error caches
- fixed openai tool calling incl. ollama translations
- direct support for llama.cpp's llama_server via llama_server_endpoint config
- basic llama_server model info in dashboard
- improved endpoint info fetching behaviour in error cases
2026-02-13 10:34:42 +01:00
1b355d8435
Merge branch 'main' into dev-v0.6 2026-02-13 10:33:36 +01:00
c545f413a5
Merge pull request #23 from JTHesse/main
Fix for SSL verification and SQL Bug
2026-02-13 10:19:07 +01:00
08b77428b8 refactor(router): bump cache TTLs and skip error cache for health checks
- Increased error and loaded model cache freshness thresholds from 10s to 30s.
- Added `skip_error_cache` parameter to `endpoint_details` to prevent cached failures from blocking health checks.
- Implemented automatic error recording in `_available_error_cache` on API request failures.
2026-02-13 10:11:41 +01:00
f7ef413090 replays 3af166c8a4 to grant merge into main 2026-02-12 16:28:40 +01:00
b649dcd8d6 proposal: use global truststore ctx for all connections 2026-02-12 16:15:39 +01:00
5c4e1e81a6
Merge pull request #24 from nomyo-ai:dependabot/pip/pillow-12.1.1
Bump pillow from 11.3.0 to 12.1.1
2026-02-12 15:58:33 +01:00
dependabot[bot]
99f5a3bc91
Bump pillow from 11.3.0 to 12.1.1
Bumps [pillow](https://github.com/python-pillow/Pillow) from 11.3.0 to 12.1.1.
- [Release notes](https://github.com/python-pillow/Pillow/releases)
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst)
- [Commits](https://github.com/python-pillow/Pillow/compare/11.3.0...12.1.1)

---
updated-dependencies:
- dependency-name: pillow
  dependency-version: 12.1.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-02-11 17:43:19 +00:00
dd30ab9422 fix SSL: CERTIFICATE_VERIFY_FAILED 2026-02-11 13:47:11 +01:00
3af166c8a4 fix sqlite3.OperationalError: no such table: main.token_time_series 2026-02-11 13:46:37 +01:00
9875eb977a feat: Add tool call normalization and streaming delta accumulation
Adds support for correctly handling tool calls in chat requests. Normalizes tool call data (ensuring IDs, types, and JSON arguments) in non-streaming mode and accumulates OpenAI-style deltas during streaming to build the final Ollama response.
2026-02-10 20:21:46 +01:00
4892998abc feat(router): Add llama-server endpoints support and model parsing
Add `llama_server_endpoints` configuration field to support llama_server OpenAI-compatible endpoints for status checks. Implement helper functions to parse model names and quantization levels from llama-server responses (best effort). Update `is_ext_openai_endpoint` to properly distinguish these endpoints from external OpenAI services. Update sample configuration documentation.
2026-02-10 16:46:51 +01:00
1f81e69ce1 refactor(router.py): correctly implement OpenAI tool_calls to Ollama format conversion 2026-02-09 11:04:14 +01:00
7deb088c6a refactor(cache): split error cache and add stale-while-revalidate
Refactor error tracking to use separate caches for 'available' and 'loaded' models, preventing cross-contamination of transient errors. Implement background refresh for available models to prevent blocking requests, and use stale-while-revalidate (300-600s) to serve stale data immediately when the cache is between 300s and 600s old.
2026-02-08 16:46:40 +01:00
92cea1dead feat: update reasoning handling
Updated reasoning content handling in router.py to check for both "reasoning_content" and "reasoning" attributes.
2026-02-08 11:29:47 +01:00
bd0d210b2a feat: enforce api key authentication and update table header
- Added proper API key validation in router.py with 401 response when key is missing
- Implemented CORS headers for authentication requests
- Updated table header from "Until" to "Unload" in static/index.html
- Improved security by preventing API key leakage in access logs
2026-02-01 10:05:46 +01:00
b718d575b7
Merge pull request #22 from nomyo-ai/dev-v0.5.x
Dev v0.5.x
2026-01-30 18:18:42 +01:00
d80b29e4f2 feat: enhance code quality and documentation
- Renamed Feedback class to follow PascalCase convention
- Fixed candidate enumeration start index from 0 to 1
- Simplified candidate content access by removing .message.content
- Updated CONFIG_PATH environment variable name to CONFIG_PATH_ARG
- Bumped version from 0.5 to 0.6
- Removed unnecessary return statement and trailing newline
2026-01-29 19:59:08 +01:00
0ebfa7c519 security: bump orjson to >=3.11.5 preventing a recursive DOS attack 2026-01-29 18:12:05 +01:00
4ca1a5667e feat(router): implement in-flight request tracking to prevent cache stampede in high concurrency scenarios
Added in-flight request tracking mechanism to prevent cache stampede when multiple concurrent requests arrive for the same endpoint. This introduces new dictionaries to track ongoing requests and a lock to coordinate access. The available_models method was refactored to use an internal helper function and includes request coalescing logic to ensure only one HTTP request is made per endpoint when cache entries expire. The loaded_models method was also updated to use the new caching and coalescing pattern.
2026-01-29 18:00:33 +01:00
7c25ffafb2
Merge pull request #21 from YetheSamartaka/model-ps-improvements
Add endpoint differentiation for models ps board
2026-01-29 10:57:22 +01:00
efdf14a207 fix: optimize table column widths and improve time formatting for responsive layout
- Reduced min-width of model columns from 340px to 200px with max-width of 300px
- Added specific styling for narrow columns (3rd-5th) with fixed width and center alignment
- Removed "Instance count" as it has redundant information
- Enhanced time formatting logic to show relative time instead of absolute dates
- Simplified digest display to show last 6 characters instead of truncated format
- Added proper handling for various time value types (number, string, null)
2026-01-29 10:54:43 +01:00
bfdae1e4a6 Merge branch 'dev-v0.5.x' of https://github.com/nomyo-ai/nomyo-router into dev-v0.5.x 2026-01-29 10:33:01 +01:00
a1276e3de8 fix: correct indentation for publish_snapshot calls in usage functions
This fix ensures that the snapshot publishing happens within the usage lock context, maintaining proper synchronization of usage counts.
2026-01-29 10:32:59 +01:00
YetheSamartaka
d3aa87ca15 Added endpoint differentiation for models ps board
Added endpoint differentiation for models PS board to see where which model is loaded and for how long to ease the viewing of multiple same models deployed for load balancing
2026-01-27 13:29:54 +01:00
ff402ba0bb
Update video link to clickable thumbnail
Replace static video link with a clickable thumbnail.
2026-01-26 18:34:30 +01:00
bdd4dd45d9
Merge pull request #20 from YetheSamartaka/main
add: Optional router-level API key that gates router/API/web UI access
2026-01-26 18:14:55 +01:00
ee1c460477 Empty key strings could bypass authentication in _extract_router_api_key() when malformed Authorization headers were sent
- Added validation to check that the extracted key is not empty before returning it
- Added CORS headers to enforce_router_api_key() for proper cross-origin request handling and CORS-related error prevention
2026-01-26 18:11:28 +01:00
d4b2558116 refactor: improve snapshot safety and usage tracking
Create atomic snapshots by deep copying usage data structures to prevent race conditions.
Protect concurrent reads of usage counts with explicit locking in endpoint selection.
Replace README screenshot with a video link.
2026-01-26 17:18:57 +01:00
3e3f0dd383 fix: endpoint selection logic 2026-01-19 14:21:08 +01:00
5ad5bfe66e feat: endpoint selection more consistent and understandable 2026-01-18 09:31:53 +01:00
067cdf641a feat: add timestamp index and improve cache concurrency
- Added index on token_time_series timestamp for faster queries
- Introduced cache locks to prevent race conditions
2026-01-16 16:47:24 +01:00
YetheSamartaka
eca4a92a33 add: Optional router-level API key that gates router/API/web UI access
Optional router-level API key that gates router/API/web UI access (leave empty to disable)

## Supplying the router API key

If you set `nomyo-router-api-key` in `config.yaml` (or `NOMYO_ROUTER_API_KEY` env), every request to NOMYO Router must include the key:

- HTTP header (recommended): `Authorization: Bearer <router_key>`
- Query param (fallback): `?api_key=<router_key>`

Examples:
```bash
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
curl "http://localhost:12434/api/tags?api_key=$NOMYO_ROUTER_API_KEY"
```
2026-01-14 09:28:02 +01:00
6828411f95
Merge pull request #19 from nomyo-ai/dev-v0.5.x
feat:
buffer_locks preventing race conditions in high concurrency scenarios
documentation folder
2026-01-06 10:51:29 +01:00
ac2a4fe8e0
Merge pull request #18 from nomyo-ai/dependabot/pip/aiohttp-3.13.3
Bump aiohttp from 3.12.15 to 3.13.3
2026-01-06 10:49:32 +01:00
dependabot[bot]
66cabcf3a9
Bump aiohttp from 3.12.15 to 3.13.3
---
updated-dependencies:
- dependency-name: aiohttp
  dependency-version: 3.13.3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-01-06 00:33:28 +00:00
20a016269d feat:
added buffer_lock to prevent race condition in high concurrency scenarios
added documentation
2026-01-05 17:16:31 +01:00