feat:

added buffer_lock to prevent race condition in high concurrency scenarios added documentation
2026-01-05 17:16:31 +01:00 · 2026-01-05 17:16:31 +01:00 · 20a016269d
commit 20a016269d
parent 434b6d4cca
9 changed files with 2167 additions and 42 deletions
--- a/doc/usage.md
+++ b/doc/usage.md
@ -0,0 +1,348 @@
+# Usage Guide
+
+## Quick Start
+
+### 1. Install the Router
+
+```bash
+git clone https://github.com/nomyo-ai/nomyo-router.git
+cd nomyo-router
+python3 -m venv .venv/router
+source .venv/router/bin/activate
+pip3 install -r requirements.txt
+```
+
+### 2. Configure Endpoints
+
+Edit `config.yaml`:
+
+```yaml
+endpoints:
+  - http://localhost:11434
+
+max_concurrent_connections: 2
+```
+
+### 3. Run the Router
+
+```bash
+uvicorn router:app --host 0.0.0.0 --port 12434
+```
+
+### 4. Use the Router
+
+Configure your frontend to point to `http://localhost:12434` instead of your Ollama instance.
+
+## API Endpoints
+
+### Ollama-Compatible Endpoints
+
+The router provides all standard Ollama API endpoints:
+
+| Endpoint        | Method | Description           |
+| --------------- | ------ | --------------------- |
+| `/api/generate` | POST   | Generate text         |
+| `/api/chat`     | POST   | Chat completions      |
+| `/api/embed`    | POST   | Embeddings            |
+| `/api/tags`     | GET    | List available models |
+| `/api/ps`       | GET    | List loaded models    |
+| `/api/show`     | POST   | Show model details    |
+| `/api/pull`     | POST   | Pull a model          |
+| `/api/push`     | POST   | Push a model          |
+| `/api/create`   | POST   | Create a model        |
+| `/api/copy`     | POST   | Copy a model          |
+| `/api/delete`   | DELETE | Delete a model        |
+
+### OpenAI-Compatible Endpoints
+
+For OpenAI API compatibility:
+
+| Endpoint               | Method | Description      |
+| ---------------------- | ------ | ---------------- |
+| `/v1/chat/completions` | POST   | Chat completions |
+| `/v1/completions`      | POST   | Text completions |
+| `/v1/embeddings`       | POST   | Embeddings       |
+| `/v1/models`           | GET    | List models      |
+
+### Monitoring Endpoints
+
+| Endpoint                           | Method | Description                              |
+| ---------------------------------- | ------ | ---------------------------------------- |
+| `/api/usage`                       | GET    | Current connection counts                |
+| `/api/token_counts`                | GET    | Token usage statistics                   |
+| `/api/stats`                       | POST   | Detailed model statistics                |
+| `/api/aggregate_time_series_days`  | POST   | Aggregate time series data into daily    |
+| `/api/version`                     | GET    | Ollama version info                      |
+| `/api/config`                      | GET    | Endpoint configuration                   |
+| `/api/usage-stream`                | GET    | Real-time usage updates (SSE)            |
+| `/health`                          | GET    | Health check                             |
+
+## Making Requests
+
+### Basic Chat Request
+
+```bash
+curl http://localhost:12434/api/chat \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama3:latest",
+    "messages": [
+      {"role": "user", "content": "Hello, how are you?"}
+    ],
+    "stream": false
+  }'
+```
+
+### Streaming Response
+
+```bash
+curl http://localhost:12434/api/chat \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama3:latest",
+    "messages": [
+      {"role": "user", "content": "Tell me a story"}
+    ],
+    "stream": true
+  }'
+```
+
+### OpenAI API Format
+
+```bash
+curl http://localhost:12434/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "gpt-4o-nano",
+    "messages": [
+      {"role": "user", "content": "Hello"}
+    ]
+  }'
+```
+
+## Advanced Features
+
+### Multiple Opinions Ensemble (MOE)
+
+Prefix your model name with `moe-` to enable the MOE system:
+
+```bash
+curl http://localhost:12434/api/chat \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "moe-llama3:latest",
+    "messages": [
+      {"role": "user", "content": "Explain quantum computing"}
+    ]
+  }'
+```
+
+The MOE system:
+
+1. Generates 3 responses from different endpoints
+2. Generates 3 critiques of those responses
+3. Selects the best response
+4. Generates a final refined response
+
+### Token Tracking
+
+The router automatically tracks token usage:
+
+```bash
+curl http://localhost:12434/api/token_counts
+```
+
+Response:
+
+```json
+{
+  "total_tokens": 1542,
+  "breakdown": [
+    {
+      "endpoint": "http://localhost:11434",
+      "model": "llama3",
+      "input_tokens": 120,
+      "output_tokens": 1422,
+      "total_tokens": 1542
+    }
+  ]
+}
+```
+
+### Real-time Monitoring
+
+Use Server-Sent Events to monitor usage in real-time:
+
+```bash
+curl http://localhost:12434/api/usage-stream
+```
+
+## Integration Examples
+
+### Python Client
+
+```python
+import requests
+
+url = "http://localhost:12434/api/chat"
+data = {
+    "model": "llama3",
+    "messages": [{"role": "user", "content": "What is AI?"}],
+    "stream": False
+}
+
+response = requests.post(url, json=data)
+print(response.json())
+```
+
+### JavaScript Client
+
+```javascript
+const response = await fetch('http://localhost:12434/api/chat', {
+  method: 'POST',
+  headers: {
+    'Content-Type': 'application/json',
+  },
+  body: JSON.stringify({
+    model: 'llama3',
+    messages: [{ role: 'user', content: 'Hello!' }],
+    stream: false
+  })
+});
+
+const data = await response.json();
+console.log(data);
+```
+
+### Streaming with JavaScript
+
+```javascript
+const eventSource = new EventSource('http://localhost:12434/api/usage-stream');
+
+eventSource.onmessage = (event) => {
+  const usage = JSON.parse(event.data);
+  console.log('Current usage:', usage);
+};
+```
+
+## Python Ollama Client
+
+```python
+from ollama import Client
+
+# Configure the client to use the router
+client = Client(host='http://localhost:12434')
+
+# Chat with a model
+response = client.chat(
+    model='llama3:latest',
+    messages=[
+        {'role': 'user', 'content': 'Explain quantum computing'}
+    ]
+)
+print(response['message']['content'])
+
+# Generate text
+response = client.generate(
+    model='llama3:latest',
+    prompt='Write a short poem about AI'
+)
+print(response['response'])
+
+# List available models
+models = client.list()['models']
+print(f"Available models: {[m['name'] for m in models]}")
+```
+
+### Python OpenAI Client
+
+```python
+from openai import OpenAI
+
+# Configure the client to use the router
+client = OpenAI(
+    base_url='http://localhost:12434/v1',
+    api_key='not-needed'  # API key is not required for local usage
+)
+
+# Chat completions
+response = client.chat.completions.create(
+    model='gpt-4o-nano',
+    messages=[
+        {'role': 'user', 'content': 'What is the meaning of life?'}
+    ]
+)
+print(response.choices[0].message.content)
+
+# Text completions
+response = client.completions.create(
+    model='llama3:latest',
+    prompt='Once upon a time'
+)
+print(response.choices[0].text)
+
+# Embeddings
+response = client.embeddings.create(
+    model='llama3:latest',
+    input='The quick brown fox jumps over the lazy dog'
+)
+print(f"Embedding length: {len(response.data[0].embedding)}")
+
+# List models
+response = client.models.list()
+print(f"Available models: {[m.id for m in response.data]}")
+```
+
+## Best Practices
+
+### 1. Model Selection
+
+- Use the same model name across all endpoints
+- For Ollama, append `:latest` or a specific version tag
+- For OpenAI endpoints, use the model name without version
+
+### 2. Connection Management
+
+- Set `max_concurrent_connections` appropriately for your hardware
+- Monitor `/api/usage` to ensure endpoints aren't overloaded
+- Consider using the MOE system for critical queries
+
+### 3. Error Handling
+
+- Check the `/health` endpoint regularly
+- Implement retry logic with exponential backoff
+- Monitor error rates and connection failures
+
+### 4. Performance
+
+- Keep frequently used models loaded on multiple endpoints
+- Use streaming for large responses
+- Monitor token usage to optimize costs
+
+## Troubleshooting
+
+### Common Issues
+
+**Problem**: Model not found
+
+- **Solution**: Ensure the model is pulled on at least one endpoint
+- **Check**: `curl http://localhost:12434/api/tags`
+
+**Problem**: Connection refused
+
+- **Solution**: Verify Ollama endpoints are running and accessible
+- **Check**: `curl http://localhost:12434/health`
+
+**Problem**: High latency
+
+- **Solution**: Check endpoint health and connection counts
+- **Check**: `curl http://localhost:12434/api/usage`
+
+**Problem**: Token tracking not working
+
+- **Solution**: Ensure database path is writable
+- **Check**: `ls -la token_counts.db`
+
+## Examples
+
+See the [examples](examples/) directory for complete integration examples.