nomyo-router/doc/usage.md

# Usage Guide

## Quick Start

### 1. Install the Router

```bash
git clone https://github.com/nomyo-ai/nomyo-router.git
cd nomyo-router
python3 -m venv .venv/router
source .venv/router/bin/activate
pip3 install -r requirements.txt
```

### 2. Configure Endpoints

Edit `config.yaml`:

```yaml
endpoints:
  - http://localhost:11434

max_concurrent_connections: 2

# Optional router-level API key (leave blank to disable)
nomyo-router-api-key: ""
```

### 3. Run the Router

```bash
uvicorn router:app --host 0.0.0.0 --port 12434
```

### 4. Use the Router

Configure your frontend to point to `http://localhost:12434` instead of your Ollama instance.

## API Endpoints

### Ollama-Compatible Endpoints

The router provides all standard Ollama API endpoints:

| Endpoint        | Method | Description           |
| --------------- | ------ | --------------------- |
| `/api/generate` | POST   | Generate text         |
| `/api/chat`     | POST   | Chat completions      |
| `/api/embed`    | POST   | Embeddings            |
| `/api/tags`     | GET    | List available models |
| `/api/ps`       | GET    | List loaded models    |
| `/api/show`     | POST   | Show model details    |
| `/api/pull`     | POST   | Pull a model          |
| `/api/push`     | POST   | Push a model          |
| `/api/create`   | POST   | Create a model        |
| `/api/copy`     | POST   | Copy a model          |
| `/api/delete`   | DELETE | Delete a model        |

### OpenAI-Compatible Endpoints

For OpenAI API compatibility:

| Endpoint               | Method | Description      |
| ---------------------- | ------ | ---------------- |
| `/v1/chat/completions` | POST   | Chat completions |
| `/v1/completions`      | POST   | Text completions |
| `/v1/embeddings`       | POST   | Embeddings       |
| `/v1/models`           | GET    | List models      |

### Monitoring Endpoints

| Endpoint                           | Method | Description                              |
| ---------------------------------- | ------ | ---------------------------------------- |
| `/api/usage`                       | GET    | Current connection counts                |
| `/api/token_counts`                | GET    | Token usage statistics                   |
| `/api/stats`                       | POST   | Detailed model statistics                |
| `/api/aggregate_time_series_days`  | POST   | Aggregate time series data into daily    |
| `/api/version`                     | GET    | Ollama version info                      |
| `/api/config`                      | GET    | Endpoint configuration                   |
| `/api/usage-stream`                | GET    | Real-time usage updates (SSE)            |
| `/health`                          | GET    | Health check                             |
| `/api/cache/stats`                 | GET    | Cache hit/miss counters and config       |
| `/api/cache/invalidate`            | POST   | Clear all cache entries and counters     |

## Making Requests

### Basic Chat Request

```bash
curl http://localhost:12434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:latest",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "stream": false
  }'
```

### Streaming Response

```bash
curl http://localhost:12434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:latest",
    "messages": [
      {"role": "user", "content": "Tell me a story"}
    ],
    "stream": true
  }'
```

### OpenAI API Format

```bash
curl http://localhost:12434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-nano",
    "messages": [
      {"role": "user", "content": "Hello"}
    ]
  }'
```

## Advanced Features

### Multiple Opinions Ensemble (MOE)

Prefix your model name with `moe-` to enable the MOE system:

```bash
curl http://localhost:12434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moe-llama3:latest",
    "messages": [
      {"role": "user", "content": "Explain quantum computing"}
    ]
  }'
```

The MOE system:

1. Generates 3 responses from different endpoints
2. Generates 3 critiques of those responses
3. Selects the best response
4. Generates a final refined response

### Semantic LLM Cache

The router can cache LLM responses and serve them instantly — bypassing endpoint selection, model loading, and token generation entirely. Cached responses work for both streaming and non-streaming clients.

Enable it in `config.yaml`:

```yaml
cache_enabled: true
cache_backend: sqlite       # persists across restarts
cache_similarity: 0.9       # semantic matching (requires :semantic image)
cache_ttl: 3600
```

For exact-match only (no extra dependencies):

```yaml
cache_enabled: true
cache_backend: sqlite
cache_similarity: 1.0
```

Check cache performance:

```bash
curl http://localhost:12434/api/cache/stats
```

```json
{
  "enabled": true,
  "hits": 1547,
  "misses": 892,
  "hit_rate": 0.634,
  "semantic": true,
  "backend": "sqlite",
  "similarity_threshold": 0.9,
  "history_weight": 0.3
}
```

Clear the cache:

```bash
curl -X POST http://localhost:12434/api/cache/invalidate
```

**Notes:**
- MOE requests (`moe-*` model prefix) always bypass the cache
- Cache is isolated per `model + system prompt` — different users with different system prompts cannot receive each other's cached responses
- Semantic matching requires the `:semantic` Docker image tag (`ghcr.io/nomyo-ai/nomyo-router:latest-semantic`)
- See [configuration.md](configuration.md#semantic-llm-cache) for all cache options

### Token Tracking

The router automatically tracks token usage:

```bash
curl http://localhost:12434/api/token_counts
```

Response:

```json
{
  "total_tokens": 1542,
  "breakdown": [
    {
      "endpoint": "http://localhost:11434",
      "model": "llama3",
      "input_tokens": 120,
      "output_tokens": 1422,
      "total_tokens": 1542
    }
  ]
}
```

### Real-time Monitoring

Use Server-Sent Events to monitor usage in real-time:

```bash
curl http://localhost:12434/api/usage-stream
```

## Integration Examples

### Python Client

```python
import requests

url = "http://localhost:12434/api/chat"
data = {
    "model": "llama3",
    "messages": [{"role": "user", "content": "What is AI?"}],
    "stream": False
}

response = requests.post(url, json=data)
print(response.json())
```

### JavaScript Client

```javascript
const response = await fetch('http://localhost:12434/api/chat', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'llama3',
    messages: [{ role: 'user', content: 'Hello!' }],
    stream: false
  })
});

const data = await response.json();
console.log(data);
```

### Streaming with JavaScript

```javascript
const eventSource = new EventSource('http://localhost:12434/api/usage-stream');

eventSource.onmessage = (event) => {
  const usage = JSON.parse(event.data);
  console.log('Current usage:', usage);
};
```

## Python Ollama Client

```python
from ollama import Client

# Configure the client to use the router
client = Client(host='http://localhost:12434')

# Chat with a model
response = client.chat(
    model='llama3:latest',
    messages=[
        {'role': 'user', 'content': 'Explain quantum computing'}
    ]
)
print(response['message']['content'])

# Generate text
response = client.generate(
    model='llama3:latest',
    prompt='Write a short poem about AI'
)
print(response['response'])

# List available models
models = client.list()['models']
print(f"Available models: {[m['name'] for m in models]}")
```

### Python OpenAI Client

```python
from openai import OpenAI

# Configure the client to use the router
client = OpenAI(
    base_url='http://localhost:12434/v1',
    api_key='not-needed'  # API key is not required for local usage
)

# Chat completions
response = client.chat.completions.create(
    model='gpt-4o-nano',
    messages=[
        {'role': 'user', 'content': 'What is the meaning of life?'}
    ]
)
print(response.choices[0].message.content)

# Text completions
response = client.completions.create(
    model='llama3:latest',
    prompt='Once upon a time'
)
print(response.choices[0].text)

# Embeddings
response = client.embeddings.create(
    model='llama3:latest',
    input='The quick brown fox jumps over the lazy dog'
)
print(f"Embedding length: {len(response.data[0].embedding)}")

# List models
response = client.models.list()
print(f"Available models: {[m.id for m in response.data]}")
```

## Best Practices

### 1. Model Selection

- Use the same model name across all endpoints
- For Ollama, append `:latest` or a specific version tag
- For OpenAI endpoints, use the model name without version

### 2. Connection Management

- Set `max_concurrent_connections` appropriately for your hardware
- Monitor `/api/usage` to ensure endpoints aren't overloaded
- Consider using the MOE system for critical queries

### 3. Error Handling

- Check the `/health` endpoint regularly
- Implement retry logic with exponential backoff
- Monitor error rates and connection failures

### 4. Performance

- Keep frequently used models loaded on multiple endpoints
- Use streaming for large responses
- Monitor token usage to optimize costs

## Troubleshooting

### Common Issues

**Problem**: Model not found

- **Solution**: Ensure the model is pulled on at least one endpoint
- **Check**: `curl http://localhost:12434/api/tags`

**Problem**: Connection refused

- **Solution**: Verify Ollama endpoints are running and accessible
- **Check**: `curl http://localhost:12434/health`

**Problem**: High latency

- **Solution**: Check endpoint health and connection counts
- **Check**: `curl http://localhost:12434/api/usage`

**Problem**: Token tracking not working

- **Solution**: Ensure database path is writable
- **Check**: `ls -la token_counts.db`

## Examples

See the [examples](examples/) directory for complete integration examples.


### Authentication to NOMYO Router

If a router API key is configured, include it with each request:
- Header: Authorization: Bearer <router_key>
- Query: ?api_key=<router_key>

Example (tags):
```bash
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
```
feat: added buffer_lock to prevent race condition in high concurrency scenarios added documentation 2026-01-05 17:16:31 +01:00			`# Usage Guide`

			`## Quick Start`

			`### 1. Install the Router`

			```bash
			`git clone https://github.com/nomyo-ai/nomyo-router.git`
			`cd nomyo-router`
			`python3 -m venv .venv/router`
			`source .venv/router/bin/activate`
			`pip3 install -r requirements.txt`
			```

			`### 2. Configure Endpoints`

			Edit `config.yaml`:

			```yaml
			`endpoints:`
			`- http://localhost:11434`

			`max_concurrent_connections: 2`
add: Optional router-level API key that gates router/API/web UI access Optional router-level API key that gates router/API/web UI access (leave empty to disable) ## Supplying the router API key If you set `nomyo-router-api-key` in `config.yaml` (or `NOMYO_ROUTER_API_KEY` env), every request to NOMYO Router must include the key: - HTTP header (recommended): `Authorization: Bearer <router_key>` - Query param (fallback): `?api_key=<router_key>` Examples: ```bash curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags curl "http://localhost:12434/api/tags?api_key=$NOMYO_ROUTER_API_KEY" ``` 2026-01-14 09:28:02 +01:00
			`# Optional router-level API key (leave blank to disable)`
			`nomyo-router-api-key: ""`
feat: added buffer_lock to prevent race condition in high concurrency scenarios added documentation 2026-01-05 17:16:31 +01:00			```

			`### 3. Run the Router`

			```bash
			`uvicorn router:app --host 0.0.0.0 --port 12434`
			```

			`### 4. Use the Router`

			Configure your frontend to point to `http://localhost:12434` instead of your Ollama instance.

			`## API Endpoints`

			`### Ollama-Compatible Endpoints`

			`The router provides all standard Ollama API endpoints:`

			`\| Endpoint \| Method \| Description \|`
			`\| --------------- \| ------ \| --------------------- \|`
			\| `/api/generate` \| POST \| Generate text \|
			\| `/api/chat` \| POST \| Chat completions \|
			\| `/api/embed` \| POST \| Embeddings \|
			\| `/api/tags` \| GET \| List available models \|
			\| `/api/ps` \| GET \| List loaded models \|
			\| `/api/show` \| POST \| Show model details \|
			\| `/api/pull` \| POST \| Pull a model \|
			\| `/api/push` \| POST \| Push a model \|
			\| `/api/create` \| POST \| Create a model \|
			\| `/api/copy` \| POST \| Copy a model \|
			\| `/api/delete` \| DELETE \| Delete a model \|

			`### OpenAI-Compatible Endpoints`

			`For OpenAI API compatibility:`

			`\| Endpoint \| Method \| Description \|`
			`\| ---------------------- \| ------ \| ---------------- \|`
			\| `/v1/chat/completions` \| POST \| Chat completions \|
			\| `/v1/completions` \| POST \| Text completions \|
			\| `/v1/embeddings` \| POST \| Embeddings \|
			\| `/v1/models` \| GET \| List models \|

			`### Monitoring Endpoints`

			`\| Endpoint \| Method \| Description \|`
			`\| ---------------------------------- \| ------ \| ---------------------------------------- \|`
			\| `/api/usage` \| GET \| Current connection counts \|
			\| `/api/token_counts` \| GET \| Token usage statistics \|
			\| `/api/stats` \| POST \| Detailed model statistics \|
			\| `/api/aggregate_time_series_days` \| POST \| Aggregate time series data into daily \|
			\| `/api/version` \| GET \| Ollama version info \|
			\| `/api/config` \| GET \| Endpoint configuration \|
			\| `/api/usage-stream` \| GET \| Real-time usage updates (SSE) \|
			\| `/health` \| GET \| Health check \|
doc: updated usage.md 2026-03-08 09:26:53 +01:00			\| `/api/cache/stats` \| GET \| Cache hit/miss counters and config \|
			\| `/api/cache/invalidate` \| POST \| Clear all cache entries and counters \|
feat: added buffer_lock to prevent race condition in high concurrency scenarios added documentation 2026-01-05 17:16:31 +01:00
			`## Making Requests`

			`### Basic Chat Request`

			```bash
			`curl http://localhost:12434/api/chat \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "llama3:latest",`
			`"messages": [`
			`{"role": "user", "content": "Hello, how are you?"}`
			`],`
			`"stream": false`
			`}'`
			```

			`### Streaming Response`

			```bash
			`curl http://localhost:12434/api/chat \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "llama3:latest",`
			`"messages": [`
			`{"role": "user", "content": "Tell me a story"}`
			`],`
			`"stream": true`
			`}'`
			```

			`### OpenAI API Format`

			```bash
			`curl http://localhost:12434/v1/chat/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "gpt-4o-nano",`
			`"messages": [`
			`{"role": "user", "content": "Hello"}`
			`]`
			`}'`
			```

			`## Advanced Features`

			`### Multiple Opinions Ensemble (MOE)`

			Prefix your model name with `moe-` to enable the MOE system:

			```bash
			`curl http://localhost:12434/api/chat \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "moe-llama3:latest",`
			`"messages": [`
			`{"role": "user", "content": "Explain quantum computing"}`
			`]`
			`}'`
			```

			`The MOE system:`

			`1. Generates 3 responses from different endpoints`
			`2. Generates 3 critiques of those responses`
			`3. Selects the best response`
			`4. Generates a final refined response`

doc: updated usage.md 2026-03-08 09:26:53 +01:00			`### Semantic LLM Cache`

			`The router can cache LLM responses and serve them instantly — bypassing endpoint selection, model loading, and token generation entirely. Cached responses work for both streaming and non-streaming clients.`

			Enable it in `config.yaml`:

			```yaml
			`cache_enabled: true`
			`cache_backend: sqlite # persists across restarts`
			`cache_similarity: 0.9 # semantic matching (requires :semantic image)`
			`cache_ttl: 3600`
			```

			`For exact-match only (no extra dependencies):`

			```yaml
			`cache_enabled: true`
			`cache_backend: sqlite`
			`cache_similarity: 1.0`
			```

			`Check cache performance:`

			```bash
			`curl http://localhost:12434/api/cache/stats`
			```

			```json
			`{`
			`"enabled": true,`
			`"hits": 1547,`
			`"misses": 892,`
			`"hit_rate": 0.634,`
			`"semantic": true,`
			`"backend": "sqlite",`
			`"similarity_threshold": 0.9,`
			`"history_weight": 0.3`
			`}`
			```

			`Clear the cache:`

			```bash
			`curl -X POST http://localhost:12434/api/cache/invalidate`
			```

			`Notes:`
			- MOE requests (`moe-*` model prefix) always bypass the cache
			- Cache is isolated per `model + system prompt` — different users with different system prompts cannot receive each other's cached responses
			- Semantic matching requires the `:semantic` Docker image tag (`ghcr.io/nomyo-ai/nomyo-router:latest-semantic`)
			`- See [configuration.md](configuration.md#semantic-llm-cache) for all cache options`

feat: added buffer_lock to prevent race condition in high concurrency scenarios added documentation 2026-01-05 17:16:31 +01:00			`### Token Tracking`

			`The router automatically tracks token usage:`

			```bash
			`curl http://localhost:12434/api/token_counts`
			```

			`Response:`

			```json
			`{`
			`"total_tokens": 1542,`
			`"breakdown": [`
			`{`
			`"endpoint": "http://localhost:11434",`
			`"model": "llama3",`
			`"input_tokens": 120,`
			`"output_tokens": 1422,`
			`"total_tokens": 1542`
			`}`
			`]`
			`}`
			```

			`### Real-time Monitoring`

			`Use Server-Sent Events to monitor usage in real-time:`

			```bash
			`curl http://localhost:12434/api/usage-stream`
			```

			`## Integration Examples`

			`### Python Client`

			```python
			`import requests`

			`url = "http://localhost:12434/api/chat"`
			`data = {`
			`"model": "llama3",`
			`"messages": [{"role": "user", "content": "What is AI?"}],`
			`"stream": False`
			`}`

			`response = requests.post(url, json=data)`
			`print(response.json())`
			```

			`### JavaScript Client`

			```javascript
			`const response = await fetch('http://localhost:12434/api/chat', {`
			`method: 'POST',`
			`headers: {`
			`'Content-Type': 'application/json',`
			`},`
			`body: JSON.stringify({`
			`model: 'llama3',`
			`messages: [{ role: 'user', content: 'Hello!' }],`
			`stream: false`
			`})`
			`});`

			`const data = await response.json();`
			`console.log(data);`
			```

			`### Streaming with JavaScript`

			```javascript
			`const eventSource = new EventSource('http://localhost:12434/api/usage-stream');`

			`eventSource.onmessage = (event) => {`
			`const usage = JSON.parse(event.data);`
			`console.log('Current usage:', usage);`
			`};`
			```

			`## Python Ollama Client`

			```python
			`from ollama import Client`

			`# Configure the client to use the router`
			`client = Client(host='http://localhost:12434')`

			`# Chat with a model`
			`response = client.chat(`
			`model='llama3:latest',`
			`messages=[`
			`{'role': 'user', 'content': 'Explain quantum computing'}`
			`]`
			`)`
			`print(response['message']['content'])`

			`# Generate text`
			`response = client.generate(`
			`model='llama3:latest',`
			`prompt='Write a short poem about AI'`
			`)`
			`print(response['response'])`

			`# List available models`
			`models = client.list()['models']`
			`print(f"Available models: {[m['name'] for m in models]}")`
			```

			`### Python OpenAI Client`

			```python
			`from openai import OpenAI`

			`# Configure the client to use the router`
			`client = OpenAI(`
			`base_url='http://localhost:12434/v1',`
			`api_key='not-needed' # API key is not required for local usage`
			`)`

			`# Chat completions`
			`response = client.chat.completions.create(`
			`model='gpt-4o-nano',`
			`messages=[`
			`{'role': 'user', 'content': 'What is the meaning of life?'}`
			`]`
			`)`
			`print(response.choices[0].message.content)`

			`# Text completions`
			`response = client.completions.create(`
			`model='llama3:latest',`
			`prompt='Once upon a time'`
			`)`
			`print(response.choices[0].text)`

			`# Embeddings`
			`response = client.embeddings.create(`
			`model='llama3:latest',`
			`input='The quick brown fox jumps over the lazy dog'`
			`)`
			`print(f"Embedding length: {len(response.data[0].embedding)}")`

			`# List models`
			`response = client.models.list()`
			`print(f"Available models: {[m.id for m in response.data]}")`
			```

			`## Best Practices`

			`### 1. Model Selection`

			`- Use the same model name across all endpoints`
			- For Ollama, append `:latest` or a specific version tag
			`- For OpenAI endpoints, use the model name without version`

			`### 2. Connection Management`

			- Set `max_concurrent_connections` appropriately for your hardware
			- Monitor `/api/usage` to ensure endpoints aren't overloaded
			`- Consider using the MOE system for critical queries`

			`### 3. Error Handling`

			- Check the `/health` endpoint regularly
			`- Implement retry logic with exponential backoff`
			`- Monitor error rates and connection failures`

			`### 4. Performance`

			`- Keep frequently used models loaded on multiple endpoints`
			`- Use streaming for large responses`
			`- Monitor token usage to optimize costs`

			`## Troubleshooting`

			`### Common Issues`

			`Problem: Model not found`

			`- Solution: Ensure the model is pulled on at least one endpoint`
			- Check: `curl http://localhost:12434/api/tags`

			`Problem: Connection refused`

			`- Solution: Verify Ollama endpoints are running and accessible`
			- Check: `curl http://localhost:12434/health`

			`Problem: High latency`

			`- Solution: Check endpoint health and connection counts`
			- Check: `curl http://localhost:12434/api/usage`

			`Problem: Token tracking not working`

			`- Solution: Ensure database path is writable`
			- Check: `ls -la token_counts.db`

			`## Examples`

			`See the [examples](examples/) directory for complete integration examples.`
add: Optional router-level API key that gates router/API/web UI access Optional router-level API key that gates router/API/web UI access (leave empty to disable) ## Supplying the router API key If you set `nomyo-router-api-key` in `config.yaml` (or `NOMYO_ROUTER_API_KEY` env), every request to NOMYO Router must include the key: - HTTP header (recommended): `Authorization: Bearer <router_key>` - Query param (fallback): `?api_key=<router_key>` Examples: ```bash curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags curl "http://localhost:12434/api/tags?api_key=$NOMYO_ROUTER_API_KEY" ``` 2026-01-14 09:28:02 +01:00

			`### Authentication to NOMYO Router`

			`If a router API key is configured, include it with each request:`
			`- Header: Authorization: Bearer <router_key>`
			`- Query: ?api_key=<router_key>`

			`Example (tags):`
			```bash
			`curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags`
			```