# Usage Guide

## Quick Start

### 1. Install the Router

```bash
git clone https://github.com/nomyo-ai/nomyo-router.git
cd nomyo-router
python3 -m venv .venv/router
source .venv/router/bin/activate
pip3 install -r requirements.txt
```

### 2. Configure Endpoints

Edit `config.yaml`:

```yaml
endpoints:
  - http://localhost:11434

max_concurrent_connections: 2
```

### 3. Run the Router

```bash
uvicorn router:app --host 0.0.0.0 --port 12434
```

### 4. Use the Router

Configure your frontend to point to `http://localhost:12434` instead of your Ollama instance.

## API Endpoints

### Ollama-Compatible Endpoints

The router provides all standard Ollama API endpoints:

| Endpoint        | Method | Description           |
| --------------- | ------ | --------------------- |
| `/api/generate` | POST   | Generate text         |
| `/api/chat`     | POST   | Chat completions      |
| `/api/embed`    | POST   | Embeddings            |
| `/api/tags`     | GET    | List available models |
| `/api/ps`       | GET    | List loaded models    |
| `/api/show`     | POST   | Show model details    |
| `/api/pull`     | POST   | Pull a model          |
| `/api/push`     | POST   | Push a model          |
| `/api/create`   | POST   | Create a model        |
| `/api/copy`     | POST   | Copy a model          |
| `/api/delete`   | DELETE | Delete a model        |

### OpenAI-Compatible Endpoints

For OpenAI API compatibility:

| Endpoint               | Method | Description      |
| ---------------------- | ------ | ---------------- |
| `/v1/chat/completions` | POST   | Chat completions |
| `/v1/completions`      | POST   | Text completions |
| `/v1/embeddings`       | POST   | Embeddings       |
| `/v1/models`           | GET    | List models      |

### Monitoring Endpoints

| Endpoint                           | Method | Description                              |
| ---------------------------------- | ------ | ---------------------------------------- |
| `/api/usage`                       | GET    | Current connection counts                |
| `/api/token_counts`                | GET    | Token usage statistics                   |
| `/api/stats`                       | POST   | Detailed model statistics                |
| `/api/aggregate_time_series_days`  | POST   | Aggregate time series data into daily    |
| `/api/version`                     | GET    | Ollama version info                      |
| `/api/config`                      | GET    | Endpoint configuration                   |
| `/api/usage-stream`                | GET    | Real-time usage updates (SSE)            |
| `/health`                          | GET    | Health check                             |

## Making Requests

### Basic Chat Request

```bash
curl http://localhost:12434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:latest",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "stream": false
  }'
```

### Streaming Response

```bash
curl http://localhost:12434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:latest",
    "messages": [
      {"role": "user", "content": "Tell me a story"}
    ],
    "stream": true
  }'
```

### OpenAI API Format

```bash
curl http://localhost:12434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-nano",
    "messages": [
      {"role": "user", "content": "Hello"}
    ]
  }'
```

## Advanced Features

### Multiple Opinions Ensemble (MOE)

Prefix your model name with `moe-` to enable the MOE system:

```bash
curl http://localhost:12434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moe-llama3:latest",
    "messages": [
      {"role": "user", "content": "Explain quantum computing"}
    ]
  }'
```

The MOE system:

1. Generates 3 responses from different endpoints
2. Generates 3 critiques of those responses
3. Selects the best response
4. Generates a final refined response

### Token Tracking

The router automatically tracks token usage:

```bash
curl http://localhost:12434/api/token_counts
```

Response:

```json
{
  "total_tokens": 1542,
  "breakdown": [
    {
      "endpoint": "http://localhost:11434",
      "model": "llama3",
      "input_tokens": 120,
      "output_tokens": 1422,
      "total_tokens": 1542
    }
  ]
}
```

### Real-time Monitoring

Use Server-Sent Events to monitor usage in real-time:

```bash
curl http://localhost:12434/api/usage-stream
```

## Integration Examples

### Python Client

```python
import requests

url = "http://localhost:12434/api/chat"
data = {
    "model": "llama3",
    "messages": [{"role": "user", "content": "What is AI?"}],
    "stream": False
}

response = requests.post(url, json=data)
print(response.json())
```

### JavaScript Client

```javascript
const response = await fetch('http://localhost:12434/api/chat', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'llama3',
    messages: [{ role: 'user', content: 'Hello!' }],
    stream: false
  })
});

const data = await response.json();
console.log(data);
```

### Streaming with JavaScript

```javascript
const eventSource = new EventSource('http://localhost:12434/api/usage-stream');

eventSource.onmessage = (event) => {
  const usage = JSON.parse(event.data);
  console.log('Current usage:', usage);
};
```

## Python Ollama Client

```python
from ollama import Client

# Configure the client to use the router
client = Client(host='http://localhost:12434')

# Chat with a model
response = client.chat(
    model='llama3:latest',
    messages=[
        {'role': 'user', 'content': 'Explain quantum computing'}
    ]
)
print(response['message']['content'])

# Generate text
response = client.generate(
    model='llama3:latest',
    prompt='Write a short poem about AI'
)
print(response['response'])

# List available models
models = client.list()['models']
print(f"Available models: {[m['name'] for m in models]}")
```

### Python OpenAI Client

```python
from openai import OpenAI

# Configure the client to use the router
client = OpenAI(
    base_url='http://localhost:12434/v1',
    api_key='not-needed'  # API key is not required for local usage
)

# Chat completions
response = client.chat.completions.create(
    model='gpt-4o-nano',
    messages=[
        {'role': 'user', 'content': 'What is the meaning of life?'}
    ]
)
print(response.choices[0].message.content)

# Text completions
response = client.completions.create(
    model='llama3:latest',
    prompt='Once upon a time'
)
print(response.choices[0].text)

# Embeddings
response = client.embeddings.create(
    model='llama3:latest',
    input='The quick brown fox jumps over the lazy dog'
)
print(f"Embedding length: {len(response.data[0].embedding)}")

# List models
response = client.models.list()
print(f"Available models: {[m.id for m in response.data]}")
```

## Best Practices

### 1. Model Selection

- Use the same model name across all endpoints
- For Ollama, append `:latest` or a specific version tag
- For OpenAI endpoints, use the model name without version

### 2. Connection Management

- Set `max_concurrent_connections` appropriately for your hardware
- Monitor `/api/usage` to ensure endpoints aren't overloaded
- Consider using the MOE system for critical queries

### 3. Error Handling

- Check the `/health` endpoint regularly
- Implement retry logic with exponential backoff
- Monitor error rates and connection failures

### 4. Performance

- Keep frequently used models loaded on multiple endpoints
- Use streaming for large responses
- Monitor token usage to optimize costs

## Troubleshooting

### Common Issues

**Problem**: Model not found

- **Solution**: Ensure the model is pulled on at least one endpoint
- **Check**: `curl http://localhost:12434/api/tags`

**Problem**: Connection refused

- **Solution**: Verify Ollama endpoints are running and accessible
- **Check**: `curl http://localhost:12434/health`

**Problem**: High latency

- **Solution**: Check endpoint health and connection counts
- **Check**: `curl http://localhost:12434/api/usage`

**Problem**: Token tracking not working

- **Solution**: Ensure database path is writable
- **Check**: `ls -la token_counts.db`

## Examples

See the [examples](examples/) directory for complete integration examples.