alpha-nerd-nomyo f0dd124118 doc: update repo base_url

2026-04-01 17:00:14 +02:00

10 KiB

Raw Blame History

Usage Guide

Quick Start

1. Install the Router

git clone https://bitfreedom.net/code/nomyo-ai/nomyo-router.git
cd nomyo-router
python3 -m venv .venv/router
source .venv/router/bin/activate
pip3 install -r requirements.txt

2. Configure Endpoints

Edit config.yaml:

endpoints:
  - http://localhost:11434

max_concurrent_connections: 2

# Optional router-level API key (leave blank to disable)
nomyo-router-api-key: ""

3. Run the Router

uvicorn router:app --host 0.0.0.0 --port 12434

4. Use the Router

Configure your frontend to point to http://localhost:12434 instead of your Ollama instance.

API Endpoints

Ollama-Compatible Endpoints

The router provides all standard Ollama API endpoints:

Endpoint	Method	Description
`/api/generate`	POST	Generate text
`/api/chat`	POST	Chat completions
`/api/embed`	POST	Embeddings
`/api/tags`	GET	List available models
`/api/ps`	GET	List loaded models
`/api/show`	POST	Show model details
`/api/pull`	POST	Pull a model
`/api/push`	POST	Push a model
`/api/create`	POST	Create a model
`/api/copy`	POST	Copy a model
`/api/delete`	DELETE	Delete a model

OpenAI-Compatible Endpoints

For OpenAI API compatibility:

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat completions
`/v1/completions`	POST	Text completions
`/v1/embeddings`	POST	Embeddings
`/v1/models`	GET	List models

Monitoring Endpoints

Endpoint	Method	Description
`/api/usage`	GET	Current connection counts
`/api/token_counts`	GET	Token usage statistics
`/api/stats`	POST	Detailed model statistics
`/api/aggregate_time_series_days`	POST	Aggregate time series data into daily
`/api/version`	GET	Ollama version info
`/api/config`	GET	Endpoint configuration
`/api/usage-stream`	GET	Real-time usage updates (SSE)
`/health`	GET	Health check
`/api/cache/stats`	GET	Cache hit/miss counters and config
`/api/cache/invalidate`	POST	Clear all cache entries and counters

Making Requests

Basic Chat Request

curl http://localhost:12434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:latest",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "stream": false
  }'

Streaming Response

curl http://localhost:12434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:latest",
    "messages": [
      {"role": "user", "content": "Tell me a story"}
    ],
    "stream": true
  }'

OpenAI API Format

curl http://localhost:12434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-nano",
    "messages": [
      {"role": "user", "content": "Hello"}
    ]
  }'

Advanced Features

Multiple Opinions Ensemble (MOE)

Prefix your model name with moe- to enable the MOE system:

curl http://localhost:12434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moe-llama3:latest",
    "messages": [
      {"role": "user", "content": "Explain quantum computing"}
    ]
  }'

The MOE system:

Generates 3 responses from different endpoints
Generates 3 critiques of those responses
Selects the best response
Generates a final refined response

Semantic LLM Cache

The router can cache LLM responses and serve them instantly — bypassing endpoint selection, model loading, and token generation entirely. Cached responses work for both streaming and non-streaming clients.

Enable it in config.yaml:

cache_enabled: true
cache_backend: sqlite       # persists across restarts
cache_similarity: 0.9       # semantic matching (requires :semantic image)
cache_ttl: 3600

For exact-match only (no extra dependencies):

cache_enabled: true
cache_backend: sqlite
cache_similarity: 1.0

Check cache performance:

curl http://localhost:12434/api/cache/stats

{
  "enabled": true,
  "hits": 1547,
  "misses": 892,
  "hit_rate": 0.634,
  "semantic": true,
  "backend": "sqlite",
  "similarity_threshold": 0.9,
  "history_weight": 0.3
}

Clear the cache:

curl -X POST http://localhost:12434/api/cache/invalidate

Notes:

MOE requests (moe-* model prefix) always bypass the cache
Cache is isolated per model + system prompt — different users with different system prompts cannot receive each other's cached responses
Semantic matching requires the :semantic Docker image tag (ghcr.io/nomyo-ai/nomyo-router:latest-semantic)
See configuration.md for all cache options

Token Tracking

The router automatically tracks token usage:

curl http://localhost:12434/api/token_counts

Response:

{
  "total_tokens": 1542,
  "breakdown": [
    {
      "endpoint": "http://localhost:11434",
      "model": "llama3",
      "input_tokens": 120,
      "output_tokens": 1422,
      "total_tokens": 1542
    }
  ]
}

Real-time Monitoring

Use Server-Sent Events to monitor usage in real-time:

curl http://localhost:12434/api/usage-stream

Integration Examples

Python Client

import requests

url = "http://localhost:12434/api/chat"
data = {
    "model": "llama3",
    "messages": [{"role": "user", "content": "What is AI?"}],
    "stream": False
}

response = requests.post(url, json=data)
print(response.json())

JavaScript Client

const response = await fetch('http://localhost:12434/api/chat', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'llama3',
    messages: [{ role: 'user', content: 'Hello!' }],
    stream: false
  })
});

const data = await response.json();
console.log(data);

Streaming with JavaScript

const eventSource = new EventSource('http://localhost:12434/api/usage-stream');

eventSource.onmessage = (event) => {
  const usage = JSON.parse(event.data);
  console.log('Current usage:', usage);
};

Python Ollama Client

from ollama import Client

# Configure the client to use the router
client = Client(host='http://localhost:12434')

# Chat with a model
response = client.chat(
    model='llama3:latest',
    messages=[
        {'role': 'user', 'content': 'Explain quantum computing'}
    ]
)
print(response['message']['content'])

# Generate text
response = client.generate(
    model='llama3:latest',
    prompt='Write a short poem about AI'
)
print(response['response'])

# List available models
models = client.list()['models']
print(f"Available models: {[m['name'] for m in models]}")

Python OpenAI Client

from openai import OpenAI

# Configure the client to use the router
client = OpenAI(
    base_url='http://localhost:12434/v1',
    api_key='not-needed'  # API key is not required for local usage
)

# Chat completions
response = client.chat.completions.create(
    model='gpt-4o-nano',
    messages=[
        {'role': 'user', 'content': 'What is the meaning of life?'}
    ]
)
print(response.choices[0].message.content)

# Text completions
response = client.completions.create(
    model='llama3:latest',
    prompt='Once upon a time'
)
print(response.choices[0].text)

# Embeddings
response = client.embeddings.create(
    model='llama3:latest',
    input='The quick brown fox jumps over the lazy dog'
)
print(f"Embedding length: {len(response.data[0].embedding)}")

# List models
response = client.models.list()
print(f"Available models: {[m.id for m in response.data]}")

Best Practices

1. Model Selection

Use the same model name across all endpoints
For Ollama, append :latest or a specific version tag
For OpenAI endpoints, use the model name without version

2. Connection Management

Set max_concurrent_connections appropriately for your hardware
Monitor /api/usage to ensure endpoints aren't overloaded
Consider using the MOE system for critical queries

3. Error Handling

Check the /health endpoint regularly
Implement retry logic with exponential backoff
Monitor error rates and connection failures

4. Performance

Keep frequently used models loaded on multiple endpoints
Use streaming for large responses
Monitor token usage to optimize costs

Troubleshooting

Common Issues

Problem: Model not found

Solution: Ensure the model is pulled on at least one endpoint
Check: curl http://localhost:12434/api/tags

Problem: Connection refused

Solution: Verify Ollama endpoints are running and accessible
Check: curl http://localhost:12434/health

Problem: High latency

Solution: Check endpoint health and connection counts
Check: curl http://localhost:12434/api/usage

Problem: Token tracking not working

Solution: Ensure database path is writable
Check: ls -la token_counts.db

Examples

See the examples directory for complete integration examples.

Authentication to NOMYO Router

If a router API key is configured, include it with each request:

Header: Authorization: Bearer <router_key>
Query: ?api_key=<router_key>

Example (tags):

curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags

10 KiB Raw Blame History