feat:
added buffer_lock to prevent race condition in high concurrency scenarios added documentation
This commit is contained in:
parent
434b6d4cca
commit
20a016269d
9 changed files with 2167 additions and 42 deletions
348
doc/usage.md
Normal file
348
doc/usage.md
Normal file
|
|
@ -0,0 +1,348 @@
|
|||
# Usage Guide
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Install the Router
|
||||
|
||||
```bash
|
||||
git clone https://github.com/nomyo-ai/nomyo-router.git
|
||||
cd nomyo-router
|
||||
python3 -m venv .venv/router
|
||||
source .venv/router/bin/activate
|
||||
pip3 install -r requirements.txt
|
||||
```
|
||||
|
||||
### 2. Configure Endpoints
|
||||
|
||||
Edit `config.yaml`:
|
||||
|
||||
```yaml
|
||||
endpoints:
|
||||
- http://localhost:11434
|
||||
|
||||
max_concurrent_connections: 2
|
||||
```
|
||||
|
||||
### 3. Run the Router
|
||||
|
||||
```bash
|
||||
uvicorn router:app --host 0.0.0.0 --port 12434
|
||||
```
|
||||
|
||||
### 4. Use the Router
|
||||
|
||||
Configure your frontend to point to `http://localhost:12434` instead of your Ollama instance.
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Ollama-Compatible Endpoints
|
||||
|
||||
The router provides all standard Ollama API endpoints:
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
| --------------- | ------ | --------------------- |
|
||||
| `/api/generate` | POST | Generate text |
|
||||
| `/api/chat` | POST | Chat completions |
|
||||
| `/api/embed` | POST | Embeddings |
|
||||
| `/api/tags` | GET | List available models |
|
||||
| `/api/ps` | GET | List loaded models |
|
||||
| `/api/show` | POST | Show model details |
|
||||
| `/api/pull` | POST | Pull a model |
|
||||
| `/api/push` | POST | Push a model |
|
||||
| `/api/create` | POST | Create a model |
|
||||
| `/api/copy` | POST | Copy a model |
|
||||
| `/api/delete` | DELETE | Delete a model |
|
||||
|
||||
### OpenAI-Compatible Endpoints
|
||||
|
||||
For OpenAI API compatibility:
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
| ---------------------- | ------ | ---------------- |
|
||||
| `/v1/chat/completions` | POST | Chat completions |
|
||||
| `/v1/completions` | POST | Text completions |
|
||||
| `/v1/embeddings` | POST | Embeddings |
|
||||
| `/v1/models` | GET | List models |
|
||||
|
||||
### Monitoring Endpoints
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
| ---------------------------------- | ------ | ---------------------------------------- |
|
||||
| `/api/usage` | GET | Current connection counts |
|
||||
| `/api/token_counts` | GET | Token usage statistics |
|
||||
| `/api/stats` | POST | Detailed model statistics |
|
||||
| `/api/aggregate_time_series_days` | POST | Aggregate time series data into daily |
|
||||
| `/api/version` | GET | Ollama version info |
|
||||
| `/api/config` | GET | Endpoint configuration |
|
||||
| `/api/usage-stream` | GET | Real-time usage updates (SSE) |
|
||||
| `/health` | GET | Health check |
|
||||
|
||||
## Making Requests
|
||||
|
||||
### Basic Chat Request
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/chat \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama3:latest",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Hello, how are you?"}
|
||||
],
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
### Streaming Response
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/chat \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama3:latest",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Tell me a story"}
|
||||
],
|
||||
"stream": true
|
||||
}'
|
||||
```
|
||||
|
||||
### OpenAI API Format
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "gpt-4o-nano",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Hello"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Multiple Opinions Ensemble (MOE)
|
||||
|
||||
Prefix your model name with `moe-` to enable the MOE system:
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/chat \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "moe-llama3:latest",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Explain quantum computing"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
The MOE system:
|
||||
|
||||
1. Generates 3 responses from different endpoints
|
||||
2. Generates 3 critiques of those responses
|
||||
3. Selects the best response
|
||||
4. Generates a final refined response
|
||||
|
||||
### Token Tracking
|
||||
|
||||
The router automatically tracks token usage:
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/token_counts
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
```json
|
||||
{
|
||||
"total_tokens": 1542,
|
||||
"breakdown": [
|
||||
{
|
||||
"endpoint": "http://localhost:11434",
|
||||
"model": "llama3",
|
||||
"input_tokens": 120,
|
||||
"output_tokens": 1422,
|
||||
"total_tokens": 1542
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Real-time Monitoring
|
||||
|
||||
Use Server-Sent Events to monitor usage in real-time:
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/usage-stream
|
||||
```
|
||||
|
||||
## Integration Examples
|
||||
|
||||
### Python Client
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
url = "http://localhost:12434/api/chat"
|
||||
data = {
|
||||
"model": "llama3",
|
||||
"messages": [{"role": "user", "content": "What is AI?"}],
|
||||
"stream": False
|
||||
}
|
||||
|
||||
response = requests.post(url, json=data)
|
||||
print(response.json())
|
||||
```
|
||||
|
||||
### JavaScript Client
|
||||
|
||||
```javascript
|
||||
const response = await fetch('http://localhost:12434/api/chat', {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Content-Type': 'application/json',
|
||||
},
|
||||
body: JSON.stringify({
|
||||
model: 'llama3',
|
||||
messages: [{ role: 'user', content: 'Hello!' }],
|
||||
stream: false
|
||||
})
|
||||
});
|
||||
|
||||
const data = await response.json();
|
||||
console.log(data);
|
||||
```
|
||||
|
||||
### Streaming with JavaScript
|
||||
|
||||
```javascript
|
||||
const eventSource = new EventSource('http://localhost:12434/api/usage-stream');
|
||||
|
||||
eventSource.onmessage = (event) => {
|
||||
const usage = JSON.parse(event.data);
|
||||
console.log('Current usage:', usage);
|
||||
};
|
||||
```
|
||||
|
||||
## Python Ollama Client
|
||||
|
||||
```python
|
||||
from ollama import Client
|
||||
|
||||
# Configure the client to use the router
|
||||
client = Client(host='http://localhost:12434')
|
||||
|
||||
# Chat with a model
|
||||
response = client.chat(
|
||||
model='llama3:latest',
|
||||
messages=[
|
||||
{'role': 'user', 'content': 'Explain quantum computing'}
|
||||
]
|
||||
)
|
||||
print(response['message']['content'])
|
||||
|
||||
# Generate text
|
||||
response = client.generate(
|
||||
model='llama3:latest',
|
||||
prompt='Write a short poem about AI'
|
||||
)
|
||||
print(response['response'])
|
||||
|
||||
# List available models
|
||||
models = client.list()['models']
|
||||
print(f"Available models: {[m['name'] for m in models]}")
|
||||
```
|
||||
|
||||
### Python OpenAI Client
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
# Configure the client to use the router
|
||||
client = OpenAI(
|
||||
base_url='http://localhost:12434/v1',
|
||||
api_key='not-needed' # API key is not required for local usage
|
||||
)
|
||||
|
||||
# Chat completions
|
||||
response = client.chat.completions.create(
|
||||
model='gpt-4o-nano',
|
||||
messages=[
|
||||
{'role': 'user', 'content': 'What is the meaning of life?'}
|
||||
]
|
||||
)
|
||||
print(response.choices[0].message.content)
|
||||
|
||||
# Text completions
|
||||
response = client.completions.create(
|
||||
model='llama3:latest',
|
||||
prompt='Once upon a time'
|
||||
)
|
||||
print(response.choices[0].text)
|
||||
|
||||
# Embeddings
|
||||
response = client.embeddings.create(
|
||||
model='llama3:latest',
|
||||
input='The quick brown fox jumps over the lazy dog'
|
||||
)
|
||||
print(f"Embedding length: {len(response.data[0].embedding)}")
|
||||
|
||||
# List models
|
||||
response = client.models.list()
|
||||
print(f"Available models: {[m.id for m in response.data]}")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Model Selection
|
||||
|
||||
- Use the same model name across all endpoints
|
||||
- For Ollama, append `:latest` or a specific version tag
|
||||
- For OpenAI endpoints, use the model name without version
|
||||
|
||||
### 2. Connection Management
|
||||
|
||||
- Set `max_concurrent_connections` appropriately for your hardware
|
||||
- Monitor `/api/usage` to ensure endpoints aren't overloaded
|
||||
- Consider using the MOE system for critical queries
|
||||
|
||||
### 3. Error Handling
|
||||
|
||||
- Check the `/health` endpoint regularly
|
||||
- Implement retry logic with exponential backoff
|
||||
- Monitor error rates and connection failures
|
||||
|
||||
### 4. Performance
|
||||
|
||||
- Keep frequently used models loaded on multiple endpoints
|
||||
- Use streaming for large responses
|
||||
- Monitor token usage to optimize costs
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Problem**: Model not found
|
||||
|
||||
- **Solution**: Ensure the model is pulled on at least one endpoint
|
||||
- **Check**: `curl http://localhost:12434/api/tags`
|
||||
|
||||
**Problem**: Connection refused
|
||||
|
||||
- **Solution**: Verify Ollama endpoints are running and accessible
|
||||
- **Check**: `curl http://localhost:12434/health`
|
||||
|
||||
**Problem**: High latency
|
||||
|
||||
- **Solution**: Check endpoint health and connection counts
|
||||
- **Check**: `curl http://localhost:12434/api/usage`
|
||||
|
||||
**Problem**: Token tracking not working
|
||||
|
||||
- **Solution**: Ensure database path is writable
|
||||
- **Check**: `ls -la token_counts.db`
|
||||
|
||||
## Examples
|
||||
|
||||
See the [examples](examples/) directory for complete integration examples.
|
||||
Loading…
Add table
Add a link
Reference in a new issue