2026-01-05 17:16:31 +01:00
|
|
|
# Usage Guide
|
|
|
|
|
|
|
|
|
|
## Quick Start
|
|
|
|
|
|
|
|
|
|
### 1. Install the Router
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
git clone https://github.com/nomyo-ai/nomyo-router.git
|
|
|
|
|
cd nomyo-router
|
|
|
|
|
python3 -m venv .venv/router
|
|
|
|
|
source .venv/router/bin/activate
|
|
|
|
|
pip3 install -r requirements.txt
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### 2. Configure Endpoints
|
|
|
|
|
|
|
|
|
|
Edit `config.yaml`:
|
|
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
|
endpoints:
|
|
|
|
|
- http://localhost:11434
|
|
|
|
|
|
|
|
|
|
max_concurrent_connections: 2
|
2026-01-14 09:28:02 +01:00
|
|
|
|
|
|
|
|
# Optional router-level API key (leave blank to disable)
|
|
|
|
|
nomyo-router-api-key: ""
|
2026-01-05 17:16:31 +01:00
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### 3. Run the Router
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
uvicorn router:app --host 0.0.0.0 --port 12434
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### 4. Use the Router
|
|
|
|
|
|
|
|
|
|
Configure your frontend to point to `http://localhost:12434` instead of your Ollama instance.
|
|
|
|
|
|
|
|
|
|
## API Endpoints
|
|
|
|
|
|
|
|
|
|
### Ollama-Compatible Endpoints
|
|
|
|
|
|
|
|
|
|
The router provides all standard Ollama API endpoints:
|
|
|
|
|
|
|
|
|
|
| Endpoint | Method | Description |
|
|
|
|
|
| --------------- | ------ | --------------------- |
|
|
|
|
|
| `/api/generate` | POST | Generate text |
|
|
|
|
|
| `/api/chat` | POST | Chat completions |
|
|
|
|
|
| `/api/embed` | POST | Embeddings |
|
|
|
|
|
| `/api/tags` | GET | List available models |
|
|
|
|
|
| `/api/ps` | GET | List loaded models |
|
|
|
|
|
| `/api/show` | POST | Show model details |
|
|
|
|
|
| `/api/pull` | POST | Pull a model |
|
|
|
|
|
| `/api/push` | POST | Push a model |
|
|
|
|
|
| `/api/create` | POST | Create a model |
|
|
|
|
|
| `/api/copy` | POST | Copy a model |
|
|
|
|
|
| `/api/delete` | DELETE | Delete a model |
|
|
|
|
|
|
|
|
|
|
### OpenAI-Compatible Endpoints
|
|
|
|
|
|
|
|
|
|
For OpenAI API compatibility:
|
|
|
|
|
|
|
|
|
|
| Endpoint | Method | Description |
|
|
|
|
|
| ---------------------- | ------ | ---------------- |
|
|
|
|
|
| `/v1/chat/completions` | POST | Chat completions |
|
|
|
|
|
| `/v1/completions` | POST | Text completions |
|
|
|
|
|
| `/v1/embeddings` | POST | Embeddings |
|
|
|
|
|
| `/v1/models` | GET | List models |
|
|
|
|
|
|
|
|
|
|
### Monitoring Endpoints
|
|
|
|
|
|
|
|
|
|
| Endpoint | Method | Description |
|
|
|
|
|
| ---------------------------------- | ------ | ---------------------------------------- |
|
|
|
|
|
| `/api/usage` | GET | Current connection counts |
|
|
|
|
|
| `/api/token_counts` | GET | Token usage statistics |
|
|
|
|
|
| `/api/stats` | POST | Detailed model statistics |
|
|
|
|
|
| `/api/aggregate_time_series_days` | POST | Aggregate time series data into daily |
|
|
|
|
|
| `/api/version` | GET | Ollama version info |
|
|
|
|
|
| `/api/config` | GET | Endpoint configuration |
|
|
|
|
|
| `/api/usage-stream` | GET | Real-time usage updates (SSE) |
|
|
|
|
|
| `/health` | GET | Health check |
|
|
|
|
|
|
|
|
|
|
## Making Requests
|
|
|
|
|
|
|
|
|
|
### Basic Chat Request
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
curl http://localhost:12434/api/chat \
|
|
|
|
|
-H "Content-Type: application/json" \
|
|
|
|
|
-d '{
|
|
|
|
|
"model": "llama3:latest",
|
|
|
|
|
"messages": [
|
|
|
|
|
{"role": "user", "content": "Hello, how are you?"}
|
|
|
|
|
],
|
|
|
|
|
"stream": false
|
|
|
|
|
}'
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Streaming Response
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
curl http://localhost:12434/api/chat \
|
|
|
|
|
-H "Content-Type: application/json" \
|
|
|
|
|
-d '{
|
|
|
|
|
"model": "llama3:latest",
|
|
|
|
|
"messages": [
|
|
|
|
|
{"role": "user", "content": "Tell me a story"}
|
|
|
|
|
],
|
|
|
|
|
"stream": true
|
|
|
|
|
}'
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### OpenAI API Format
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
curl http://localhost:12434/v1/chat/completions \
|
|
|
|
|
-H "Content-Type: application/json" \
|
|
|
|
|
-d '{
|
|
|
|
|
"model": "gpt-4o-nano",
|
|
|
|
|
"messages": [
|
|
|
|
|
{"role": "user", "content": "Hello"}
|
|
|
|
|
]
|
|
|
|
|
}'
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Advanced Features
|
|
|
|
|
|
|
|
|
|
### Multiple Opinions Ensemble (MOE)
|
|
|
|
|
|
|
|
|
|
Prefix your model name with `moe-` to enable the MOE system:
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
curl http://localhost:12434/api/chat \
|
|
|
|
|
-H "Content-Type: application/json" \
|
|
|
|
|
-d '{
|
|
|
|
|
"model": "moe-llama3:latest",
|
|
|
|
|
"messages": [
|
|
|
|
|
{"role": "user", "content": "Explain quantum computing"}
|
|
|
|
|
]
|
|
|
|
|
}'
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
The MOE system:
|
|
|
|
|
|
|
|
|
|
1. Generates 3 responses from different endpoints
|
|
|
|
|
2. Generates 3 critiques of those responses
|
|
|
|
|
3. Selects the best response
|
|
|
|
|
4. Generates a final refined response
|
|
|
|
|
|
|
|
|
|
### Token Tracking
|
|
|
|
|
|
|
|
|
|
The router automatically tracks token usage:
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
curl http://localhost:12434/api/token_counts
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Response:
|
|
|
|
|
|
|
|
|
|
```json
|
|
|
|
|
{
|
|
|
|
|
"total_tokens": 1542,
|
|
|
|
|
"breakdown": [
|
|
|
|
|
{
|
|
|
|
|
"endpoint": "http://localhost:11434",
|
|
|
|
|
"model": "llama3",
|
|
|
|
|
"input_tokens": 120,
|
|
|
|
|
"output_tokens": 1422,
|
|
|
|
|
"total_tokens": 1542
|
|
|
|
|
}
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Real-time Monitoring
|
|
|
|
|
|
|
|
|
|
Use Server-Sent Events to monitor usage in real-time:
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
curl http://localhost:12434/api/usage-stream
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Integration Examples
|
|
|
|
|
|
|
|
|
|
### Python Client
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
import requests
|
|
|
|
|
|
|
|
|
|
url = "http://localhost:12434/api/chat"
|
|
|
|
|
data = {
|
|
|
|
|
"model": "llama3",
|
|
|
|
|
"messages": [{"role": "user", "content": "What is AI?"}],
|
|
|
|
|
"stream": False
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
response = requests.post(url, json=data)
|
|
|
|
|
print(response.json())
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### JavaScript Client
|
|
|
|
|
|
|
|
|
|
```javascript
|
|
|
|
|
const response = await fetch('http://localhost:12434/api/chat', {
|
|
|
|
|
method: 'POST',
|
|
|
|
|
headers: {
|
|
|
|
|
'Content-Type': 'application/json',
|
|
|
|
|
},
|
|
|
|
|
body: JSON.stringify({
|
|
|
|
|
model: 'llama3',
|
|
|
|
|
messages: [{ role: 'user', content: 'Hello!' }],
|
|
|
|
|
stream: false
|
|
|
|
|
})
|
|
|
|
|
});
|
|
|
|
|
|
|
|
|
|
const data = await response.json();
|
|
|
|
|
console.log(data);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Streaming with JavaScript
|
|
|
|
|
|
|
|
|
|
```javascript
|
|
|
|
|
const eventSource = new EventSource('http://localhost:12434/api/usage-stream');
|
|
|
|
|
|
|
|
|
|
eventSource.onmessage = (event) => {
|
|
|
|
|
const usage = JSON.parse(event.data);
|
|
|
|
|
console.log('Current usage:', usage);
|
|
|
|
|
};
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Python Ollama Client
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
from ollama import Client
|
|
|
|
|
|
|
|
|
|
# Configure the client to use the router
|
|
|
|
|
client = Client(host='http://localhost:12434')
|
|
|
|
|
|
|
|
|
|
# Chat with a model
|
|
|
|
|
response = client.chat(
|
|
|
|
|
model='llama3:latest',
|
|
|
|
|
messages=[
|
|
|
|
|
{'role': 'user', 'content': 'Explain quantum computing'}
|
|
|
|
|
]
|
|
|
|
|
)
|
|
|
|
|
print(response['message']['content'])
|
|
|
|
|
|
|
|
|
|
# Generate text
|
|
|
|
|
response = client.generate(
|
|
|
|
|
model='llama3:latest',
|
|
|
|
|
prompt='Write a short poem about AI'
|
|
|
|
|
)
|
|
|
|
|
print(response['response'])
|
|
|
|
|
|
|
|
|
|
# List available models
|
|
|
|
|
models = client.list()['models']
|
|
|
|
|
print(f"Available models: {[m['name'] for m in models]}")
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Python OpenAI Client
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
from openai import OpenAI
|
|
|
|
|
|
|
|
|
|
# Configure the client to use the router
|
|
|
|
|
client = OpenAI(
|
|
|
|
|
base_url='http://localhost:12434/v1',
|
|
|
|
|
api_key='not-needed' # API key is not required for local usage
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
# Chat completions
|
|
|
|
|
response = client.chat.completions.create(
|
|
|
|
|
model='gpt-4o-nano',
|
|
|
|
|
messages=[
|
|
|
|
|
{'role': 'user', 'content': 'What is the meaning of life?'}
|
|
|
|
|
]
|
|
|
|
|
)
|
|
|
|
|
print(response.choices[0].message.content)
|
|
|
|
|
|
|
|
|
|
# Text completions
|
|
|
|
|
response = client.completions.create(
|
|
|
|
|
model='llama3:latest',
|
|
|
|
|
prompt='Once upon a time'
|
|
|
|
|
)
|
|
|
|
|
print(response.choices[0].text)
|
|
|
|
|
|
|
|
|
|
# Embeddings
|
|
|
|
|
response = client.embeddings.create(
|
|
|
|
|
model='llama3:latest',
|
|
|
|
|
input='The quick brown fox jumps over the lazy dog'
|
|
|
|
|
)
|
|
|
|
|
print(f"Embedding length: {len(response.data[0].embedding)}")
|
|
|
|
|
|
|
|
|
|
# List models
|
|
|
|
|
response = client.models.list()
|
|
|
|
|
print(f"Available models: {[m.id for m in response.data]}")
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Best Practices
|
|
|
|
|
|
|
|
|
|
### 1. Model Selection
|
|
|
|
|
|
|
|
|
|
- Use the same model name across all endpoints
|
|
|
|
|
- For Ollama, append `:latest` or a specific version tag
|
|
|
|
|
- For OpenAI endpoints, use the model name without version
|
|
|
|
|
|
|
|
|
|
### 2. Connection Management
|
|
|
|
|
|
|
|
|
|
- Set `max_concurrent_connections` appropriately for your hardware
|
|
|
|
|
- Monitor `/api/usage` to ensure endpoints aren't overloaded
|
|
|
|
|
- Consider using the MOE system for critical queries
|
|
|
|
|
|
|
|
|
|
### 3. Error Handling
|
|
|
|
|
|
|
|
|
|
- Check the `/health` endpoint regularly
|
|
|
|
|
- Implement retry logic with exponential backoff
|
|
|
|
|
- Monitor error rates and connection failures
|
|
|
|
|
|
|
|
|
|
### 4. Performance
|
|
|
|
|
|
|
|
|
|
- Keep frequently used models loaded on multiple endpoints
|
|
|
|
|
- Use streaming for large responses
|
|
|
|
|
- Monitor token usage to optimize costs
|
|
|
|
|
|
|
|
|
|
## Troubleshooting
|
|
|
|
|
|
|
|
|
|
### Common Issues
|
|
|
|
|
|
|
|
|
|
**Problem**: Model not found
|
|
|
|
|
|
|
|
|
|
- **Solution**: Ensure the model is pulled on at least one endpoint
|
|
|
|
|
- **Check**: `curl http://localhost:12434/api/tags`
|
|
|
|
|
|
|
|
|
|
**Problem**: Connection refused
|
|
|
|
|
|
|
|
|
|
- **Solution**: Verify Ollama endpoints are running and accessible
|
|
|
|
|
- **Check**: `curl http://localhost:12434/health`
|
|
|
|
|
|
|
|
|
|
**Problem**: High latency
|
|
|
|
|
|
|
|
|
|
- **Solution**: Check endpoint health and connection counts
|
|
|
|
|
- **Check**: `curl http://localhost:12434/api/usage`
|
|
|
|
|
|
|
|
|
|
**Problem**: Token tracking not working
|
|
|
|
|
|
|
|
|
|
- **Solution**: Ensure database path is writable
|
|
|
|
|
- **Check**: `ls -la token_counts.db`
|
|
|
|
|
|
|
|
|
|
## Examples
|
|
|
|
|
|
|
|
|
|
See the [examples](examples/) directory for complete integration examples.
|
2026-01-14 09:28:02 +01:00
|
|
|
|
|
|
|
|
|
|
|
|
|
### Authentication to NOMYO Router
|
|
|
|
|
|
|
|
|
|
If a router API key is configured, include it with each request:
|
|
|
|
|
- Header: Authorization: Bearer <router_key>
|
|
|
|
|
- Query: ?api_key=<router_key>
|
|
|
|
|
|
|
|
|
|
Example (tags):
|
|
|
|
|
```bash
|
|
|
|
|
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
|
|
|
|
|
```
|