nomyo-router/doc/usage.md

10 KiB

Usage Guide

Quick Start

1. Install the Router

git clone https://bitfreedom.net/code/nomyo-ai/nomyo-router.git
cd nomyo-router
python3 -m venv .venv/router
source .venv/router/bin/activate
pip3 install -r requirements.txt

2. Configure Endpoints

Edit config.yaml:

endpoints:
  - http://localhost:11434

max_concurrent_connections: 2

# Optional router-level API key (leave blank to disable)
nomyo-router-api-key: ""

3. Run the Router

uvicorn router:app --host 0.0.0.0 --port 12434

4. Use the Router

Configure your frontend to point to http://localhost:12434 instead of your Ollama instance.

API Endpoints

Ollama-Compatible Endpoints

The router provides all standard Ollama API endpoints:

Endpoint Method Description
/api/generate POST Generate text
/api/chat POST Chat completions
/api/embed POST Embeddings
/api/tags GET List available models
/api/ps GET List loaded models
/api/show POST Show model details
/api/pull POST Pull a model
/api/push POST Push a model
/api/create POST Create a model
/api/copy POST Copy a model
/api/delete DELETE Delete a model

OpenAI-Compatible Endpoints

For OpenAI API compatibility:

Endpoint Method Description
/v1/chat/completions POST Chat completions
/v1/completions POST Text completions
/v1/embeddings POST Embeddings
/v1/models GET List models

Monitoring Endpoints

Endpoint Method Description
/api/usage GET Current connection counts
/api/token_counts GET Token usage statistics
/api/stats POST Detailed model statistics
/api/aggregate_time_series_days POST Aggregate time series data into daily
/api/version GET Ollama version info
/api/config GET Endpoint configuration
/api/usage-stream GET Real-time usage updates (SSE)
/health GET Health check
/api/cache/stats GET Cache hit/miss counters and config
/api/cache/invalidate POST Clear all cache entries and counters

Making Requests

Basic Chat Request

curl http://localhost:12434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:latest",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "stream": false
  }'

Streaming Response

curl http://localhost:12434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:latest",
    "messages": [
      {"role": "user", "content": "Tell me a story"}
    ],
    "stream": true
  }'

OpenAI API Format

curl http://localhost:12434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-nano",
    "messages": [
      {"role": "user", "content": "Hello"}
    ]
  }'

Advanced Features

Multiple Opinions Ensemble (MOE)

Prefix your model name with moe- to enable the MOE system:

curl http://localhost:12434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moe-llama3:latest",
    "messages": [
      {"role": "user", "content": "Explain quantum computing"}
    ]
  }'

The MOE system:

  1. Generates 3 responses from different endpoints
  2. Generates 3 critiques of those responses
  3. Selects the best response
  4. Generates a final refined response

Semantic LLM Cache

The router can cache LLM responses and serve them instantly — bypassing endpoint selection, model loading, and token generation entirely. Cached responses work for both streaming and non-streaming clients.

Enable it in config.yaml:

cache_enabled: true
cache_backend: sqlite       # persists across restarts
cache_similarity: 0.9       # semantic matching (requires :semantic image)
cache_ttl: 3600

For exact-match only (no extra dependencies):

cache_enabled: true
cache_backend: sqlite
cache_similarity: 1.0

Check cache performance:

curl http://localhost:12434/api/cache/stats
{
  "enabled": true,
  "hits": 1547,
  "misses": 892,
  "hit_rate": 0.634,
  "semantic": true,
  "backend": "sqlite",
  "similarity_threshold": 0.9,
  "history_weight": 0.3
}

Clear the cache:

curl -X POST http://localhost:12434/api/cache/invalidate

Notes:

  • MOE requests (moe-* model prefix) always bypass the cache
  • Cache is isolated per model + system prompt — different users with different system prompts cannot receive each other's cached responses
  • Semantic matching requires the :semantic Docker image tag (ghcr.io/nomyo-ai/nomyo-router:latest-semantic)
  • See configuration.md for all cache options

Token Tracking

The router automatically tracks token usage:

curl http://localhost:12434/api/token_counts

Response:

{
  "total_tokens": 1542,
  "breakdown": [
    {
      "endpoint": "http://localhost:11434",
      "model": "llama3",
      "input_tokens": 120,
      "output_tokens": 1422,
      "total_tokens": 1542
    }
  ]
}

Real-time Monitoring

Use Server-Sent Events to monitor usage in real-time:

curl http://localhost:12434/api/usage-stream

Integration Examples

Python Client

import requests

url = "http://localhost:12434/api/chat"
data = {
    "model": "llama3",
    "messages": [{"role": "user", "content": "What is AI?"}],
    "stream": False
}

response = requests.post(url, json=data)
print(response.json())

JavaScript Client

const response = await fetch('http://localhost:12434/api/chat', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'llama3',
    messages: [{ role: 'user', content: 'Hello!' }],
    stream: false
  })
});

const data = await response.json();
console.log(data);

Streaming with JavaScript

const eventSource = new EventSource('http://localhost:12434/api/usage-stream');

eventSource.onmessage = (event) => {
  const usage = JSON.parse(event.data);
  console.log('Current usage:', usage);
};

Python Ollama Client

from ollama import Client

# Configure the client to use the router
client = Client(host='http://localhost:12434')

# Chat with a model
response = client.chat(
    model='llama3:latest',
    messages=[
        {'role': 'user', 'content': 'Explain quantum computing'}
    ]
)
print(response['message']['content'])

# Generate text
response = client.generate(
    model='llama3:latest',
    prompt='Write a short poem about AI'
)
print(response['response'])

# List available models
models = client.list()['models']
print(f"Available models: {[m['name'] for m in models]}")

Python OpenAI Client

from openai import OpenAI

# Configure the client to use the router
client = OpenAI(
    base_url='http://localhost:12434/v1',
    api_key='not-needed'  # API key is not required for local usage
)

# Chat completions
response = client.chat.completions.create(
    model='gpt-4o-nano',
    messages=[
        {'role': 'user', 'content': 'What is the meaning of life?'}
    ]
)
print(response.choices[0].message.content)

# Text completions
response = client.completions.create(
    model='llama3:latest',
    prompt='Once upon a time'
)
print(response.choices[0].text)

# Embeddings
response = client.embeddings.create(
    model='llama3:latest',
    input='The quick brown fox jumps over the lazy dog'
)
print(f"Embedding length: {len(response.data[0].embedding)}")

# List models
response = client.models.list()
print(f"Available models: {[m.id for m in response.data]}")

Best Practices

1. Model Selection

  • Use the same model name across all endpoints
  • For Ollama, append :latest or a specific version tag
  • For OpenAI endpoints, use the model name without version

2. Connection Management

  • Set max_concurrent_connections appropriately for your hardware
  • Monitor /api/usage to ensure endpoints aren't overloaded
  • Consider using the MOE system for critical queries

3. Error Handling

  • Check the /health endpoint regularly
  • Implement retry logic with exponential backoff
  • Monitor error rates and connection failures

4. Performance

  • Keep frequently used models loaded on multiple endpoints
  • Use streaming for large responses
  • Monitor token usage to optimize costs

Troubleshooting

Common Issues

Problem: Model not found

  • Solution: Ensure the model is pulled on at least one endpoint
  • Check: curl http://localhost:12434/api/tags

Problem: Connection refused

  • Solution: Verify Ollama endpoints are running and accessible
  • Check: curl http://localhost:12434/health

Problem: High latency

  • Solution: Check endpoint health and connection counts
  • Check: curl http://localhost:12434/api/usage

Problem: Token tracking not working

  • Solution: Ensure database path is writable
  • Check: ls -la token_counts.db

Examples

See the examples directory for complete integration examples.

Authentication to NOMYO Router

If a router API key is configured, include it with each request:

  • Header: Authorization: Bearer <router_key>
  • Query: ?api_key=<router_key>

Example (tags):

curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags