added buffer_lock to prevent race condition in high concurrency scenarios added documentation
8.2 KiB
8.2 KiB
Usage Guide
Quick Start
1. Install the Router
git clone https://github.com/nomyo-ai/nomyo-router.git
cd nomyo-router
python3 -m venv .venv/router
source .venv/router/bin/activate
pip3 install -r requirements.txt
2. Configure Endpoints
Edit config.yaml:
endpoints:
- http://localhost:11434
max_concurrent_connections: 2
3. Run the Router
uvicorn router:app --host 0.0.0.0 --port 12434
4. Use the Router
Configure your frontend to point to http://localhost:12434 instead of your Ollama instance.
API Endpoints
Ollama-Compatible Endpoints
The router provides all standard Ollama API endpoints:
| Endpoint | Method | Description |
|---|---|---|
/api/generate |
POST | Generate text |
/api/chat |
POST | Chat completions |
/api/embed |
POST | Embeddings |
/api/tags |
GET | List available models |
/api/ps |
GET | List loaded models |
/api/show |
POST | Show model details |
/api/pull |
POST | Pull a model |
/api/push |
POST | Push a model |
/api/create |
POST | Create a model |
/api/copy |
POST | Copy a model |
/api/delete |
DELETE | Delete a model |
OpenAI-Compatible Endpoints
For OpenAI API compatibility:
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Chat completions |
/v1/completions |
POST | Text completions |
/v1/embeddings |
POST | Embeddings |
/v1/models |
GET | List models |
Monitoring Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/usage |
GET | Current connection counts |
/api/token_counts |
GET | Token usage statistics |
/api/stats |
POST | Detailed model statistics |
/api/aggregate_time_series_days |
POST | Aggregate time series data into daily |
/api/version |
GET | Ollama version info |
/api/config |
GET | Endpoint configuration |
/api/usage-stream |
GET | Real-time usage updates (SSE) |
/health |
GET | Health check |
Making Requests
Basic Chat Request
curl http://localhost:12434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3:latest",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"stream": false
}'
Streaming Response
curl http://localhost:12434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3:latest",
"messages": [
{"role": "user", "content": "Tell me a story"}
],
"stream": true
}'
OpenAI API Format
curl http://localhost:12434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-nano",
"messages": [
{"role": "user", "content": "Hello"}
]
}'
Advanced Features
Multiple Opinions Ensemble (MOE)
Prefix your model name with moe- to enable the MOE system:
curl http://localhost:12434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "moe-llama3:latest",
"messages": [
{"role": "user", "content": "Explain quantum computing"}
]
}'
The MOE system:
- Generates 3 responses from different endpoints
- Generates 3 critiques of those responses
- Selects the best response
- Generates a final refined response
Token Tracking
The router automatically tracks token usage:
curl http://localhost:12434/api/token_counts
Response:
{
"total_tokens": 1542,
"breakdown": [
{
"endpoint": "http://localhost:11434",
"model": "llama3",
"input_tokens": 120,
"output_tokens": 1422,
"total_tokens": 1542
}
]
}
Real-time Monitoring
Use Server-Sent Events to monitor usage in real-time:
curl http://localhost:12434/api/usage-stream
Integration Examples
Python Client
import requests
url = "http://localhost:12434/api/chat"
data = {
"model": "llama3",
"messages": [{"role": "user", "content": "What is AI?"}],
"stream": False
}
response = requests.post(url, json=data)
print(response.json())
JavaScript Client
const response = await fetch('http://localhost:12434/api/chat', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'llama3',
messages: [{ role: 'user', content: 'Hello!' }],
stream: false
})
});
const data = await response.json();
console.log(data);
Streaming with JavaScript
const eventSource = new EventSource('http://localhost:12434/api/usage-stream');
eventSource.onmessage = (event) => {
const usage = JSON.parse(event.data);
console.log('Current usage:', usage);
};
Python Ollama Client
from ollama import Client
# Configure the client to use the router
client = Client(host='http://localhost:12434')
# Chat with a model
response = client.chat(
model='llama3:latest',
messages=[
{'role': 'user', 'content': 'Explain quantum computing'}
]
)
print(response['message']['content'])
# Generate text
response = client.generate(
model='llama3:latest',
prompt='Write a short poem about AI'
)
print(response['response'])
# List available models
models = client.list()['models']
print(f"Available models: {[m['name'] for m in models]}")
Python OpenAI Client
from openai import OpenAI
# Configure the client to use the router
client = OpenAI(
base_url='http://localhost:12434/v1',
api_key='not-needed' # API key is not required for local usage
)
# Chat completions
response = client.chat.completions.create(
model='gpt-4o-nano',
messages=[
{'role': 'user', 'content': 'What is the meaning of life?'}
]
)
print(response.choices[0].message.content)
# Text completions
response = client.completions.create(
model='llama3:latest',
prompt='Once upon a time'
)
print(response.choices[0].text)
# Embeddings
response = client.embeddings.create(
model='llama3:latest',
input='The quick brown fox jumps over the lazy dog'
)
print(f"Embedding length: {len(response.data[0].embedding)}")
# List models
response = client.models.list()
print(f"Available models: {[m.id for m in response.data]}")
Best Practices
1. Model Selection
- Use the same model name across all endpoints
- For Ollama, append
:latestor a specific version tag - For OpenAI endpoints, use the model name without version
2. Connection Management
- Set
max_concurrent_connectionsappropriately for your hardware - Monitor
/api/usageto ensure endpoints aren't overloaded - Consider using the MOE system for critical queries
3. Error Handling
- Check the
/healthendpoint regularly - Implement retry logic with exponential backoff
- Monitor error rates and connection failures
4. Performance
- Keep frequently used models loaded on multiple endpoints
- Use streaming for large responses
- Monitor token usage to optimize costs
Troubleshooting
Common Issues
Problem: Model not found
- Solution: Ensure the model is pulled on at least one endpoint
- Check:
curl http://localhost:12434/api/tags
Problem: Connection refused
- Solution: Verify Ollama endpoints are running and accessible
- Check:
curl http://localhost:12434/health
Problem: High latency
- Solution: Check endpoint health and connection counts
- Check:
curl http://localhost:12434/api/usage
Problem: Token tracking not working
- Solution: Ensure database path is writable
- Check:
ls -la token_counts.db
Examples
See the examples directory for complete integration examples.