Merge pull request 'dev-v0.7.x to prod' (#1) from dev-v0.7.x into main

Reviewed-on: https://bitfreedom.net/code/code/nomyo-ai/nomyo-router/pulls/1
2026-04-02 09:17:59 +02:00 · 2026-04-02 09:17:59 +02:00 · ba1b2fb651
commit ba1b2fb651
parent 031de165a1 b899ac8559
5 changed files with 44 additions and 30 deletions
--- a/doc/README.md
+++ b/doc/README.md
@ -24,7 +24,7 @@ doc/
 1. **Install the router**:

   ```bash
-   git clone https://github.com/nomyo-ai/nomyo-router.git
+   git clone https://bitfreedom.net/code/nomyo-ai/nomyo-router.git
   cd nomyo-router
   python3 -m venv .venv/router
   source .venv/router/bin/activate
@ -36,14 +36,19 @@ doc/
   endpoints:
     - http://localhost:11434
   max_concurrent_connections: 2
-  # Optional router-level API key (leave blank to disable)
-  nomyo-router-api-key: ""
   ```
+
+# Optional router-level API key (leave blank to disable)
+
+nomyo-router-api-key: ""
+
+```
 3. **Run the router**:

-   ```bash
-   uvicorn router:app --host 0.0.0.0 --port 12434
-   ```
+```bash
+uvicorn router:app --host 0.0.0.0 --port 12434
+```
+
 4. **Use the router**: Point your frontend to `http://localhost:12434` instead of your Ollama instance.

 ### Key Features
@ -138,14 +143,15 @@ For additional help:

 Happy routing! 🚀

-
 ## Router API key usage

 If the router API key is set (`NOMYO_ROUTER_API_KEY` env or `nomyo-router-api-key` in config), include it in every request:
+
 - Header (preferred): Authorization: Bearer <router_key>
 - Query param: ?api_key=<router_key>

 Example:
+
 ```bash
 curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
 ```
--- a/doc/deployment.md
+++ b/doc/deployment.md
@ -16,7 +16,7 @@ NOMYO Router can be deployed in various environments depending on your requireme

 ```bash
 # Clone the repository
-git clone https://github.com/nomyo-ai/nomyo-router.git
+git clone https://bitfreedom.net/code/nomyo-ai/nomyo-router.git
 cd nomyo-router

 # Create virtual environment
@ -84,10 +84,11 @@ sudo systemctl status nomyo-router

 ### Image variants

-| Tag | Semantic cache | Image size |
-|---|---|---|
-| `latest` | ❌ exact match only | ~300 MB |
-| `latest-semantic` | ✅ sentence-transformers + `all-MiniLM-L6-v2` pre-baked | ~800 MB |
+
+| Tag               | Semantic cache                                         | Image size |
+| ------------------- | -------------------------------------------------------- | ------------ |
+| `latest`          | ❌ exact match only                                    | ~300 MB    |
+| `latest-semantic` | ✅ sentence-transformers +`all-MiniLM-L6-v2` pre-baked | ~800 MB    |

 The `:semantic` variant enables `cache_similarity < 1.0` in `config.yaml`. The lean image falls back to exact-match caching with a warning if semantic mode is configured.

--- a/doc/usage.md
+++ b/doc/usage.md
@ -5,7 +5,7 @@
 ### 1. Install the Router

 ```bash
-git clone https://github.com/nomyo-ai/nomyo-router.git
+git clone https://bitfreedom.net/code/nomyo-ai/nomyo-router.git
 cd nomyo-router
 python3 -m venv .venv/router
 source .venv/router/bin/activate
@ -42,8 +42,9 @@ Configure your frontend to point to `http://localhost:12434` instead of your Oll

 The router provides all standard Ollama API endpoints:

+
 | Endpoint        | Method | Description           |
-| --------------- | ------ | --------------------- |
+| ----------------- | -------- | ----------------------- |
 | `/api/generate` | POST   | Generate text         |
 | `/api/chat`     | POST   | Chat completions      |
 | `/api/embed`    | POST   | Embeddings            |
@ -60,8 +61,9 @@ The router provides all standard Ollama API endpoints:

 For OpenAI API compatibility:

+
 | Endpoint               | Method | Description      |
-| ---------------------- | ------ | ---------------- |
+| ------------------------ | -------- | ------------------ |
 | `/v1/chat/completions` | POST   | Chat completions |
 | `/v1/completions`      | POST   | Text completions |
 | `/v1/embeddings`       | POST   | Embeddings       |
@ -69,18 +71,19 @@ For OpenAI API compatibility:

 ### Monitoring Endpoints

-| Endpoint                           | Method | Description                              |
-| ---------------------------------- | ------ | ---------------------------------------- |
-| `/api/usage`                       | GET    | Current connection counts                |
-| `/api/token_counts`                | GET    | Token usage statistics                   |
-| `/api/stats`                       | POST   | Detailed model statistics                |
-| `/api/aggregate_time_series_days`  | POST   | Aggregate time series data into daily    |
-| `/api/version`                     | GET    | Ollama version info                      |
-| `/api/config`                      | GET    | Endpoint configuration                   |
-| `/api/usage-stream`                | GET    | Real-time usage updates (SSE)            |
-| `/health`                          | GET    | Health check                             |
-| `/api/cache/stats`                 | GET    | Cache hit/miss counters and config       |
-| `/api/cache/invalidate`            | POST   | Clear all cache entries and counters     |
+
+| Endpoint                          | Method | Description                           |
+| ----------------------------------- | -------- | --------------------------------------- |
+| `/api/usage`                      | GET    | Current connection counts             |
+| `/api/token_counts`               | GET    | Token usage statistics                |
+| `/api/stats`                      | POST   | Detailed model statistics             |
+| `/api/aggregate_time_series_days` | POST   | Aggregate time series data into daily |
+| `/api/version`                    | GET    | Ollama version info                   |
+| `/api/config`                     | GET    | Endpoint configuration                |
+| `/api/usage-stream`               | GET    | Real-time usage updates (SSE)         |
+| `/health`                         | GET    | Health check                          |
+| `/api/cache/stats`                | GET    | Cache hit/miss counters and config    |
+| `/api/cache/invalidate`           | POST   | Clear all cache entries and counters  |

 ## Making Requests

@ -196,6 +199,7 @@ curl -X POST http://localhost:12434/api/cache/invalidate
 ```

 **Notes:**
+
 - MOE requests (`moe-*` model prefix) always bypass the cache
 - Cache is isolated per `model + system prompt` — different users with different system prompts cannot receive each other's cached responses
 - Semantic matching requires the `:semantic` Docker image tag (`ghcr.io/nomyo-ai/nomyo-router:latest-semantic`)
@ -404,14 +408,15 @@ print(f"Available models: {[m.id for m in response.data]}")

 See the [examples](examples/) directory for complete integration examples.

-
 ### Authentication to NOMYO Router

 If a router API key is configured, include it with each request:
+
 - Header: Authorization: Bearer <router_key>
 - Query: ?api_key=<router_key>

 Example (tags):
+
 ```bash
 curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
 ```
--- a/requirements.txt
+++ b/requirements.txt
@ -42,4 +42,4 @@ yarl==1.20.1
 aiosqlite
 # Semantic LLM cache — base install (exact-match mode, no heavy ML deps)
 # For semantic mode use the :semantic Docker image tag (adds sentence-transformers + torch)
-semantic-llm-cache@git+https://github.com/nomyo-ai/async-semantic-llm-cache.git@v0.1
+semantic-llm-cache@git+https://bitfreedom.net/code/nomyo-ai/async-semantic-llm-cache.git@v0.1.1
--- a/static/index.html
+++ b/static/index.html
@ -1092,7 +1092,9 @@ function renderTimeSeriesChart(timeSeriesData, chart, minutes) {
            function updateTpsChart(payload) {
                const tokens = payload.token_usage_counts || {};
                const perModelTokens = {};
-                psRows.forEach((_, model) => {
+                const allModels = new Set();
+                for (const ep in tokens) for (const model in tokens[ep]) allModels.add(model);
+                allModels.forEach(model => {
                    let total = 0;
                    for (const ep in tokens) total += tokens[ep]?.[model] || 0;
                    // Normalise against the first-seen cumulative total so history