Compare commits

...

8 commits
v0.7.4 ... main

Author SHA1 Message Date
a3928c9c33 Merge pull request 'dev-v0.7.x -> main' (#25) from dev-v0.7.x into main
All checks were successful
Build and Publish Docker Image (Semantic Cache) / build (amd64, linux/amd64, docker-amd64) (push) Successful in 34s
Build and Publish Docker Image / build (amd64, linux/amd64, docker-amd64) (push) Successful in 38s
Build and Publish Docker Image (Semantic Cache) / build (arm64, linux/arm64, docker-arm64) (push) Successful in 10m5s
Build and Publish Docker Image (Semantic Cache) / merge (push) Successful in 31s
Build and Publish Docker Image / build (arm64, linux/arm64, docker-arm64) (push) Successful in 10m11s
Build and Publish Docker Image / merge (push) Successful in 31s
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/25
2026-04-16 12:27:34 +02:00
1a2781ac23
fix: health check all endpoints with right per enpoint path
issue: resolving #24
2026-04-16 12:18:38 +02:00
a3e7e8a007
sec: bump pillow version to mitigate vuln 2026-04-14 09:31:05 +02:00
5ac412eb5c
doc: feature updates 2026-04-14 09:17:33 +02:00
537b757c4a
fix: align pip cmds 2026-04-13 14:13:35 +02:00
f4b3a09151 Merge pull request 'dev-v0.7.x -> main' (#22) from dev-v0.7.x into main
All checks were successful
Build and Publish Docker Image (Semantic Cache) / build (amd64, linux/amd64, docker-amd64) (push) Successful in 37s
Build and Publish Docker Image / build (amd64, linux/amd64, docker-amd64) (push) Successful in 36s
Build and Publish Docker Image (Semantic Cache) / build (arm64, linux/arm64, docker-arm64) (push) Successful in 10m10s
Build and Publish Docker Image (Semantic Cache) / merge (push) Successful in 33s
Build and Publish Docker Image / build (arm64, linux/arm64, docker-arm64) (push) Successful in 10m2s
Build and Publish Docker Image / merge (push) Successful in 33s
Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/22
2026-04-13 14:00:20 +02:00
1058f2418b
fix: security, exempt files to prevent path traversal 2026-04-10 17:40:44 +02:00
263c66aedd
feat: add hostname to dashboard 2026-04-10 17:29:43 +02:00
5 changed files with 74 additions and 12 deletions

View file

@ -26,8 +26,8 @@ RUN pip install --root-user-action=ignore --no-cache-dir --upgrade pip \
# CPU-only torch must be installed before sentence-transformers to avoid # CPU-only torch must be installed before sentence-transformers to avoid
# pulling the full CUDA-enabled build (~2.5 GB). # pulling the full CUDA-enabled build (~2.5 GB).
RUN if [ "$SEMANTIC_CACHE" = "true" ]; then \ RUN if [ "$SEMANTIC_CACHE" = "true" ]; then \
pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu && \ pip install --root-user-action=ignore --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu && \
pip install --no-cache-dir sentence-transformers && \ pip install --root-user-action=ignore --no-cache-dir sentence-transformers && \
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"; \ python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"; \
fi fi

View file

@ -127,6 +127,34 @@ The router can proxy requests to OpenAI-compatible endpoints alongside Ollama en
- Handles authentication with API keys - Handles authentication with API keys
- Maintains consistent behavior across endpoint types - Maintains consistent behavior across endpoint types
### Reactive Context-Shift
When a backend returns a `exceed_context_size_error` (context window exceeded), the router automatically trims the conversation history and retries rather than surfacing the error to the client.
**How it works:**
1. The error body contains `n_ctx` (the model's context limit) and `n_prompt_tokens` (the actual token count as measured by the backend).
2. `_calibrated_trim_target()` computes a tiktoken-scale trim target using the *delta* between actual tokens and the context limit, correcting for the fact that tiktoken counts fewer tokens than the backend tokeniser does.
3. `_trim_messages_for_context()` implements a sliding-window drop: system messages are always preserved; the oldest non-system messages are evicted first (FIFO) until the estimated token count fits the target. The most recent message is never dropped. After trimming, leading assistant/tool messages are removed to satisfy chat-template requirements (first non-system message must be a user message).
4. Two retry attempts are made:
- **Retry 1** — trimmed messages, original tool definitions.
- **Retry 2** — trimmed messages with tool definitions also stripped (handles cases where tool schemas alone consume too many tokens).
**Proactive pre-trimming:**
Once a context overflow has been observed for an endpoint/model pair whose `n_ctx` ≤ 32 768, the router records that limit in `_endpoint_nctx`. Subsequent requests to the same pair are pre-trimmed before being sent, avoiding the round-trip to the backend entirely for small-context models.
### Reactive SSE Push
The `/api/usage-stream` endpoint delivers real-time usage updates using a pub/sub push model rather than client polling.
**Mechanism:**
- `subscribe()` creates a bounded `asyncio.Queue` (capacity 10) and registers it in `_subscribers`.
- Whenever `usage_counts` or `token_usage_counts` change — on every `increment_usage`, `decrement_usage`, or token-worker flush — `_capture_snapshot()` serialises the current state to JSON while the caller still holds the relevant lock, then `_distribute_snapshot()` pushes the snapshot to every registered queue outside the lock.
- If a subscriber's queue is full (slow client), the oldest undelivered snapshot is evicted before the new one is enqueued, so fast producers never block on slow consumers.
- `unsubscribe()` removes the queue when the SSE connection closes; `close_all_sse_queues()` sends a `None` sentinel to all subscribers during router shutdown.
## Performance Considerations ## Performance Considerations
### Concurrency Model ### Concurrency Model
@ -145,7 +173,7 @@ The router can proxy requests to OpenAI-compatible endpoints alongside Ollama en
### Memory Management ### Memory Management
- **Write-behind pattern**: Token counts buffered in memory, flushed periodically - **Write-behind pattern**: Token counts buffered in memory, flushed periodically
- **Queue-based SSE**: Server-Sent Events use bounded queues to prevent memory bloat - **Queue-based SSE**: Bounded per-subscriber queues (capacity 10) with oldest-eviction — see [Reactive SSE Push](#reactive-sse-push)
- **Automatic cleanup**: Zero connection counts are removed from tracking - **Automatic cleanup**: Zero connection counts are removed from tracking
## Error Handling ## Error Handling

View file

@ -22,7 +22,7 @@ ollama==0.6.1
openai==1.102.0 openai==1.102.0
orjson>=3.11.5 orjson>=3.11.5
numpy>=1.26 numpy>=1.26
pillow==12.1.1 pillow==12.2.0
propcache==0.3.2 propcache==0.3.2
pydantic==2.11.7 pydantic==2.11.7
pydantic-settings==2.10.1 pydantic-settings==2.10.1

View file

@ -6,7 +6,7 @@ version: 0.7
license: AGPL license: AGPL
""" """
# ------------------------------------------------------------- # -------------------------------------------------------------
import orjson, time, asyncio, yaml, ollama, openai, os, re, aiohttp, ssl, random, base64, io, enhance, secrets, math import orjson, time, asyncio, yaml, ollama, openai, os, re, aiohttp, ssl, random, base64, io, enhance, secrets, math, socket
try: try:
import truststore; truststore.inject_into_ssl() import truststore; truststore.inject_into_ssl()
except ImportError: except ImportError:
@ -373,7 +373,11 @@ async def enforce_router_api_key(request: Request, call_next):
return await call_next(request) return await call_next(request)
path = request.url.path path = request.url.path
if path.startswith("/static") or path in {"/", "/favicon.ico"}: # Allow static assets (CSS, JS, images, fonts) but NOT HTML pages,
# which would bypass auth by accessing /static/index.html directly.
_STATIC_ASSET_EXTS = {".css", ".js", ".ico", ".png", ".jpg", ".jpeg", ".svg", ".woff", ".woff2", ".ttf", ".map"}
is_static_asset = path.startswith("/static") and Path(path).suffix.lower() in _STATIC_ASSET_EXTS
if is_static_asset or path in {"/", "/favicon.ico"}:
return await call_next(request) return await call_next(request)
provided_key = _extract_router_api_key(request) provided_key = _extract_router_api_key(request)
@ -3750,21 +3754,37 @@ async def health_proxy(request: Request):
- `endpoints`: a mapping of endpoint URL `{status, version|detail}`. - `endpoints`: a mapping of endpoint URL `{status, version|detail}`.
* The HTTP status code is 200 when everything is healthy, 503 otherwise. * The HTTP status code is 200 when everything is healthy, 503 otherwise.
""" """
# Run all health checks in parallel # Run all health checks in parallel.
tasks = [fetch.endpoint_details(ep, "/api/version", "version", skip_error_cache=True) for ep in config.endpoints] # if not is_ext_openai_endpoint(ep)] # Ollama endpoints expose /api/version; OpenAI-compatible endpoints (vLLM,
# llama-server, external) expose /models. Using /api/version against an
# OpenAI-compatible endpoint yields a 404 and noisy log output.
all_endpoints = list(config.endpoints)
llama_eps_extra = [ep for ep in config.llama_server_endpoints if ep not in config.endpoints]
all_endpoints += llama_eps_extra
tasks = []
for ep in all_endpoints:
if is_openai_compatible(ep):
tasks.append(fetch.endpoint_details(ep, "/models", "data", config.api_keys.get(ep), skip_error_cache=True))
else:
tasks.append(fetch.endpoint_details(ep, "/api/version", "version", skip_error_cache=True))
results = await asyncio.gather(*tasks, return_exceptions=True) results = await asyncio.gather(*tasks, return_exceptions=True)
health_summary = {} health_summary = {}
overall_ok = True overall_ok = True
for ep, result in zip(config.endpoints, results): for ep, result in zip(all_endpoints, results):
if isinstance(result, Exception): if isinstance(result, Exception):
# Endpoint did not respond / returned an error # Endpoint did not respond / returned an error
health_summary[ep] = {"status": "error", "detail": str(result)} health_summary[ep] = {"status": "error", "detail": str(result)}
overall_ok = False overall_ok = False
else: else:
# Successful response report the reported version # Successful response report the reported version (Ollama) or
# indicate the endpoint is reachable (OpenAI-compatible).
if is_openai_compatible(ep):
health_summary[ep] = {"status": "ok"}
else:
health_summary[ep] = {"status": "ok", "version": result} health_summary[ep] = {"status": "ok", "version": result}
response_payload = { response_payload = {
@ -3776,7 +3796,15 @@ async def health_proxy(request: Request):
return JSONResponse(content=response_payload, status_code=http_status) return JSONResponse(content=response_payload, status_code=http_status)
# ------------------------------------------------------------- # -------------------------------------------------------------
# 27. SSE route for usage broadcasts # 27. Hostname endpoint
# -------------------------------------------------------------
@app.get("/api/hostname")
async def get_hostname():
"""Return the hostname of the machine running the router."""
return JSONResponse(content={"hostname": socket.gethostname()})
# -------------------------------------------------------------
# 28. SSE route for usage broadcasts
# ------------------------------------------------------------- # -------------------------------------------------------------
@app.get("/api/usage-stream") @app.get("/api/usage-stream")
async def usage_stream(request: Request): async def usage_stream(request: Request):

View file

@ -344,6 +344,7 @@
</div> </div>
<div class="header-row"> <div class="header-row">
<h1>Router Dashboard</h1> <h1>Router Dashboard</h1>
<span id="hostname" style="color:#777; font-size:0.85em;"></span>
<button id="total-tokens-btn">Stats Total</button> <button id="total-tokens-btn">Stats Total</button>
<span id="aggregation-status" class="loading" style="margin-left:8px;"></span> <span id="aggregation-status" class="loading" style="margin-left:8px;"></span>
</div> </div>
@ -1418,6 +1419,11 @@ function initStatsChart(timeSeriesData, endpointDistribution) {
</script> </script>
<script> <script>
document.addEventListener('DOMContentLoaded', () => { document.addEventListener('DOMContentLoaded', () => {
authedFetch('/api/hostname').then(r => r.json()).then(data => {
const el = document.getElementById('hostname');
if (el && data.hostname) el.textContent = data.hostname;
}).catch(() => {});
const totalBtn = document.getElementById('total-tokens-btn'); const totalBtn = document.getElementById('total-tokens-btn');
if (totalBtn) { if (totalBtn) {
totalBtn.addEventListener('click', async () => { totalBtn.addEventListener('click', async () => {