added buffer_lock to prevent race condition in high concurrency scenarios
added documentation

2026-01-05 17:16:31 +01:00

8.2 KiB

Raw Blame History

Usage Guide

Quick Start

1. Install the Router

git clone https://github.com/nomyo-ai/nomyo-router.git
cd nomyo-router
python3 -m venv .venv/router
source .venv/router/bin/activate
pip3 install -r requirements.txt

2. Configure Endpoints

Edit config.yaml:

endpoints:
  - http://localhost:11434

max_concurrent_connections: 2

3. Run the Router

uvicorn router:app --host 0.0.0.0 --port 12434

4. Use the Router

Configure your frontend to point to http://localhost:12434 instead of your Ollama instance.

API Endpoints

Ollama-Compatible Endpoints

The router provides all standard Ollama API endpoints:

Endpoint	Method	Description
`/api/generate`	POST	Generate text
`/api/chat`	POST	Chat completions
`/api/embed`	POST	Embeddings
`/api/tags`	GET	List available models
`/api/ps`	GET	List loaded models
`/api/show`	POST	Show model details
`/api/pull`	POST	Pull a model
`/api/push`	POST	Push a model
`/api/create`	POST	Create a model
`/api/copy`	POST	Copy a model
`/api/delete`	DELETE	Delete a model

OpenAI-Compatible Endpoints

For OpenAI API compatibility:

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat completions
`/v1/completions`	POST	Text completions
`/v1/embeddings`	POST	Embeddings
`/v1/models`	GET	List models

Monitoring Endpoints

Endpoint	Method	Description
`/api/usage`	GET	Current connection counts
`/api/token_counts`	GET	Token usage statistics
`/api/stats`	POST	Detailed model statistics
`/api/aggregate_time_series_days`	POST	Aggregate time series data into daily
`/api/version`	GET	Ollama version info
`/api/config`	GET	Endpoint configuration
`/api/usage-stream`	GET	Real-time usage updates (SSE)
`/health`	GET	Health check

Making Requests

Basic Chat Request

curl http://localhost:12434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:latest",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "stream": false
  }'

Streaming Response

curl http://localhost:12434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:latest",
    "messages": [
      {"role": "user", "content": "Tell me a story"}
    ],
    "stream": true
  }'

OpenAI API Format

curl http://localhost:12434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-nano",
    "messages": [
      {"role": "user", "content": "Hello"}
    ]
  }'

Advanced Features

Multiple Opinions Ensemble (MOE)

Prefix your model name with moe- to enable the MOE system:

curl http://localhost:12434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moe-llama3:latest",
    "messages": [
      {"role": "user", "content": "Explain quantum computing"}
    ]
  }'

The MOE system:

Generates 3 responses from different endpoints
Generates 3 critiques of those responses
Selects the best response
Generates a final refined response

Token Tracking

The router automatically tracks token usage:

curl http://localhost:12434/api/token_counts

Response:

{
  "total_tokens": 1542,
  "breakdown": [
    {
      "endpoint": "http://localhost:11434",
      "model": "llama3",
      "input_tokens": 120,
      "output_tokens": 1422,
      "total_tokens": 1542
    }
  ]
}

Real-time Monitoring

Use Server-Sent Events to monitor usage in real-time:

curl http://localhost:12434/api/usage-stream

Integration Examples

Python Client

import requests

url = "http://localhost:12434/api/chat"
data = {
    "model": "llama3",
    "messages": [{"role": "user", "content": "What is AI?"}],
    "stream": False
}

response = requests.post(url, json=data)
print(response.json())

JavaScript Client

const response = await fetch('http://localhost:12434/api/chat', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'llama3',
    messages: [{ role: 'user', content: 'Hello!' }],
    stream: false
  })
});

const data = await response.json();
console.log(data);

Streaming with JavaScript

const eventSource = new EventSource('http://localhost:12434/api/usage-stream');

eventSource.onmessage = (event) => {
  const usage = JSON.parse(event.data);
  console.log('Current usage:', usage);
};

Python Ollama Client

from ollama import Client

# Configure the client to use the router
client = Client(host='http://localhost:12434')

# Chat with a model
response = client.chat(
    model='llama3:latest',
    messages=[
        {'role': 'user', 'content': 'Explain quantum computing'}
    ]
)
print(response['message']['content'])

# Generate text
response = client.generate(
    model='llama3:latest',
    prompt='Write a short poem about AI'
)
print(response['response'])

# List available models
models = client.list()['models']
print(f"Available models: {[m['name'] for m in models]}")

Python OpenAI Client

from openai import OpenAI

# Configure the client to use the router
client = OpenAI(
    base_url='http://localhost:12434/v1',
    api_key='not-needed'  # API key is not required for local usage
)

# Chat completions
response = client.chat.completions.create(
    model='gpt-4o-nano',
    messages=[
        {'role': 'user', 'content': 'What is the meaning of life?'}
    ]
)
print(response.choices[0].message.content)

# Text completions
response = client.completions.create(
    model='llama3:latest',
    prompt='Once upon a time'
)
print(response.choices[0].text)

# Embeddings
response = client.embeddings.create(
    model='llama3:latest',
    input='The quick brown fox jumps over the lazy dog'
)
print(f"Embedding length: {len(response.data[0].embedding)}")

# List models
response = client.models.list()
print(f"Available models: {[m.id for m in response.data]}")

Best Practices

1. Model Selection

Use the same model name across all endpoints
For Ollama, append :latest or a specific version tag
For OpenAI endpoints, use the model name without version

2. Connection Management

Set max_concurrent_connections appropriately for your hardware
Monitor /api/usage to ensure endpoints aren't overloaded
Consider using the MOE system for critical queries

3. Error Handling

Check the /health endpoint regularly
Implement retry logic with exponential backoff
Monitor error rates and connection failures

4. Performance

Keep frequently used models loaded on multiple endpoints
Use streaming for large responses
Monitor token usage to optimize costs

Troubleshooting

Common Issues

Problem: Model not found

Solution: Ensure the model is pulled on at least one endpoint
Check: curl http://localhost:12434/api/tags

Problem: Connection refused

Solution: Verify Ollama endpoints are running and accessible
Check: curl http://localhost:12434/health

Problem: High latency

Solution: Check endpoint health and connection counts
Check: curl http://localhost:12434/api/usage

Problem: Token tracking not working

Solution: Ensure database path is writable
Check: ls -la token_counts.db

Examples

See the examples directory for complete integration examples.

8.2 KiB Raw Blame History