nomyo-router/doc/usage.md

418 lines
10 KiB
Markdown
Raw Normal View History

# Usage Guide
## Quick Start
### 1. Install the Router
```bash
git clone https://github.com/nomyo-ai/nomyo-router.git
cd nomyo-router
python3 -m venv .venv/router
source .venv/router/bin/activate
pip3 install -r requirements.txt
```
### 2. Configure Endpoints
Edit `config.yaml`:
```yaml
endpoints:
- http://localhost:11434
max_concurrent_connections: 2
# Optional router-level API key (leave blank to disable)
nomyo-router-api-key: ""
```
### 3. Run the Router
```bash
uvicorn router:app --host 0.0.0.0 --port 12434
```
### 4. Use the Router
Configure your frontend to point to `http://localhost:12434` instead of your Ollama instance.
## API Endpoints
### Ollama-Compatible Endpoints
The router provides all standard Ollama API endpoints:
| Endpoint | Method | Description |
| --------------- | ------ | --------------------- |
| `/api/generate` | POST | Generate text |
| `/api/chat` | POST | Chat completions |
| `/api/embed` | POST | Embeddings |
| `/api/tags` | GET | List available models |
| `/api/ps` | GET | List loaded models |
| `/api/show` | POST | Show model details |
| `/api/pull` | POST | Pull a model |
| `/api/push` | POST | Push a model |
| `/api/create` | POST | Create a model |
| `/api/copy` | POST | Copy a model |
| `/api/delete` | DELETE | Delete a model |
### OpenAI-Compatible Endpoints
For OpenAI API compatibility:
| Endpoint | Method | Description |
| ---------------------- | ------ | ---------------- |
| `/v1/chat/completions` | POST | Chat completions |
| `/v1/completions` | POST | Text completions |
| `/v1/embeddings` | POST | Embeddings |
| `/v1/models` | GET | List models |
### Monitoring Endpoints
| Endpoint | Method | Description |
| ---------------------------------- | ------ | ---------------------------------------- |
| `/api/usage` | GET | Current connection counts |
| `/api/token_counts` | GET | Token usage statistics |
| `/api/stats` | POST | Detailed model statistics |
| `/api/aggregate_time_series_days` | POST | Aggregate time series data into daily |
| `/api/version` | GET | Ollama version info |
| `/api/config` | GET | Endpoint configuration |
| `/api/usage-stream` | GET | Real-time usage updates (SSE) |
| `/health` | GET | Health check |
2026-03-08 09:26:53 +01:00
| `/api/cache/stats` | GET | Cache hit/miss counters and config |
| `/api/cache/invalidate` | POST | Clear all cache entries and counters |
## Making Requests
### Basic Chat Request
```bash
curl http://localhost:12434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3:latest",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"stream": false
}'
```
### Streaming Response
```bash
curl http://localhost:12434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3:latest",
"messages": [
{"role": "user", "content": "Tell me a story"}
],
"stream": true
}'
```
### OpenAI API Format
```bash
curl http://localhost:12434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-nano",
"messages": [
{"role": "user", "content": "Hello"}
]
}'
```
## Advanced Features
### Multiple Opinions Ensemble (MOE)
Prefix your model name with `moe-` to enable the MOE system:
```bash
curl http://localhost:12434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "moe-llama3:latest",
"messages": [
{"role": "user", "content": "Explain quantum computing"}
]
}'
```
The MOE system:
1. Generates 3 responses from different endpoints
2. Generates 3 critiques of those responses
3. Selects the best response
4. Generates a final refined response
2026-03-08 09:26:53 +01:00
### Semantic LLM Cache
The router can cache LLM responses and serve them instantly — bypassing endpoint selection, model loading, and token generation entirely. Cached responses work for both streaming and non-streaming clients.
Enable it in `config.yaml`:
```yaml
cache_enabled: true
cache_backend: sqlite # persists across restarts
cache_similarity: 0.9 # semantic matching (requires :semantic image)
cache_ttl: 3600
```
For exact-match only (no extra dependencies):
```yaml
cache_enabled: true
cache_backend: sqlite
cache_similarity: 1.0
```
Check cache performance:
```bash
curl http://localhost:12434/api/cache/stats
```
```json
{
"enabled": true,
"hits": 1547,
"misses": 892,
"hit_rate": 0.634,
"semantic": true,
"backend": "sqlite",
"similarity_threshold": 0.9,
"history_weight": 0.3
}
```
Clear the cache:
```bash
curl -X POST http://localhost:12434/api/cache/invalidate
```
**Notes:**
- MOE requests (`moe-*` model prefix) always bypass the cache
- Cache is isolated per `model + system prompt` — different users with different system prompts cannot receive each other's cached responses
- Semantic matching requires the `:semantic` Docker image tag (`ghcr.io/nomyo-ai/nomyo-router:latest-semantic`)
- See [configuration.md](configuration.md#semantic-llm-cache) for all cache options
### Token Tracking
The router automatically tracks token usage:
```bash
curl http://localhost:12434/api/token_counts
```
Response:
```json
{
"total_tokens": 1542,
"breakdown": [
{
"endpoint": "http://localhost:11434",
"model": "llama3",
"input_tokens": 120,
"output_tokens": 1422,
"total_tokens": 1542
}
]
}
```
### Real-time Monitoring
Use Server-Sent Events to monitor usage in real-time:
```bash
curl http://localhost:12434/api/usage-stream
```
## Integration Examples
### Python Client
```python
import requests
url = "http://localhost:12434/api/chat"
data = {
"model": "llama3",
"messages": [{"role": "user", "content": "What is AI?"}],
"stream": False
}
response = requests.post(url, json=data)
print(response.json())
```
### JavaScript Client
```javascript
const response = await fetch('http://localhost:12434/api/chat', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'llama3',
messages: [{ role: 'user', content: 'Hello!' }],
stream: false
})
});
const data = await response.json();
console.log(data);
```
### Streaming with JavaScript
```javascript
const eventSource = new EventSource('http://localhost:12434/api/usage-stream');
eventSource.onmessage = (event) => {
const usage = JSON.parse(event.data);
console.log('Current usage:', usage);
};
```
## Python Ollama Client
```python
from ollama import Client
# Configure the client to use the router
client = Client(host='http://localhost:12434')
# Chat with a model
response = client.chat(
model='llama3:latest',
messages=[
{'role': 'user', 'content': 'Explain quantum computing'}
]
)
print(response['message']['content'])
# Generate text
response = client.generate(
model='llama3:latest',
prompt='Write a short poem about AI'
)
print(response['response'])
# List available models
models = client.list()['models']
print(f"Available models: {[m['name'] for m in models]}")
```
### Python OpenAI Client
```python
from openai import OpenAI
# Configure the client to use the router
client = OpenAI(
base_url='http://localhost:12434/v1',
api_key='not-needed' # API key is not required for local usage
)
# Chat completions
response = client.chat.completions.create(
model='gpt-4o-nano',
messages=[
{'role': 'user', 'content': 'What is the meaning of life?'}
]
)
print(response.choices[0].message.content)
# Text completions
response = client.completions.create(
model='llama3:latest',
prompt='Once upon a time'
)
print(response.choices[0].text)
# Embeddings
response = client.embeddings.create(
model='llama3:latest',
input='The quick brown fox jumps over the lazy dog'
)
print(f"Embedding length: {len(response.data[0].embedding)}")
# List models
response = client.models.list()
print(f"Available models: {[m.id for m in response.data]}")
```
## Best Practices
### 1. Model Selection
- Use the same model name across all endpoints
- For Ollama, append `:latest` or a specific version tag
- For OpenAI endpoints, use the model name without version
### 2. Connection Management
- Set `max_concurrent_connections` appropriately for your hardware
- Monitor `/api/usage` to ensure endpoints aren't overloaded
- Consider using the MOE system for critical queries
### 3. Error Handling
- Check the `/health` endpoint regularly
- Implement retry logic with exponential backoff
- Monitor error rates and connection failures
### 4. Performance
- Keep frequently used models loaded on multiple endpoints
- Use streaming for large responses
- Monitor token usage to optimize costs
## Troubleshooting
### Common Issues
**Problem**: Model not found
- **Solution**: Ensure the model is pulled on at least one endpoint
- **Check**: `curl http://localhost:12434/api/tags`
**Problem**: Connection refused
- **Solution**: Verify Ollama endpoints are running and accessible
- **Check**: `curl http://localhost:12434/health`
**Problem**: High latency
- **Solution**: Check endpoint health and connection counts
- **Check**: `curl http://localhost:12434/api/usage`
**Problem**: Token tracking not working
- **Solution**: Ensure database path is writable
- **Check**: `ls -la token_counts.db`
## Examples
See the [examples](examples/) directory for complete integration examples.
### Authentication to NOMYO Router
If a router API key is configured, include it with each request:
- Header: Authorization: Bearer <router_key>
- Query: ?api_key=<router_key>
Example (tags):
```bash
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags
```