nomyo-ai/nomyo-router

Fork 0

alpha nerd 5ac412eb5c

doc: feature updates

2026-04-14 09:17:33 +02:00

10 KiB

Raw Blame History

NOMYO Router Architecture

Overview

NOMYO Router is a transparent proxy for Ollama with model deployment-aware routing. It sits between your frontend application and Ollama backend(s), providing intelligent request routing based on model availability and load balancing.

Core Components

1. Request Routing Engine

The router's core intelligence is in the choose_endpoint() function, which implements a sophisticated routing algorithm:

async def choose_endpoint(model: str) -> str:
    """
    Endpoint selection algorithm:
    1. Query all endpoints for advertised models
    2. Filter endpoints that advertise the requested model
    3. Among candidates, find those with the model loaded AND free slots
    4. If none loaded with free slots, pick any with free slots
    5. If all saturated, pick endpoint with lowest current usage
    6. If no endpoint advertises the model, raise error
    """

2. Connection Tracking

The router maintains real-time connection counts per endpoint-model pair:

usage_counts: Dict[str, Dict[str, int]] = defaultdict(lambda: defaultdict(int))

This allows for:

Load-aware routing: Requests are routed to endpoints with available capacity
Model-aware routing: Requests are routed to endpoints where the model is already loaded
Efficient resource utilization: Minimizes model loading/unloading operations

3. Caching Layer

Three types of caches improve performance:

Models cache (_models_cache): Caches available models per endpoint (300s TTL)
Loaded models cache (_loaded_models_cache): Caches currently loaded models (30s TTL)
Error cache (_error_cache): Caches transient errors (10s TTL)

4. Token Tracking System

Comprehensive token usage tracking:

token_buffer: dict[str, dict[str, tuple[int, int]]] = defaultdict(lambda: defaultdict(lambda: (0, 0)))
time_series_buffer: list[dict[str, int | str]] = []

Features:

Real-time token counting for input/output tokens
Periodic flushing to SQLite database (every 10 seconds)
Time-series data for historical analysis
Per-endpoint, per-model breakdown

5. API Compatibility Layer

The router supports multiple API formats:

Ollama API: Native /api/generate, /api/chat, /api/embed endpoints
OpenAI API: Compatible /v1/chat/completions, /v1/completions, /v1/embeddings endpoints
Transparent conversion: Responses are converted between formats as needed

Data Flow

Request Processing

Ingress: Frontend sends request to router
Endpoint Selection: Router determines optimal endpoint
Request Forwarding: Request sent to selected Ollama endpoint
Response Streaming: Response streamed back to frontend
Usage Tracking: Connection and token counts updated
Egress: Complete response returned to frontend

Connection Management

sequenceDiagram
    participant Frontend
    participant Router
    participant Endpoint1
    participant Endpoint2

    Frontend->>Router: Request for model X
    Router->>Endpoint1: Check if model X is loaded
    Router->>Endpoint2: Check if model X is loaded
    alt Endpoint1 has model X loaded
        Router->>Endpoint1: Forward request
        Endpoint1->>Router: Stream response
        Router->>Frontend: Stream response
    else Endpoint2 has model X loaded
        Router->>Endpoint2: Forward request
        Endpoint2->>Router: Stream response
        Router->>Frontend: Stream response
    else No endpoint has model X loaded
        Router->>Endpoint1: Forward request (will trigger load)
        Endpoint1->>Router: Stream response
        Router->>Frontend: Stream response
    end

Advanced Features

Multiple Opinions Ensemble (MOE)

When the user prefixes a model name with moe-, the router activates the MOE system:

Generates 3 responses from different endpoints
Generates 3 critiques of those responses
Selects the best response based on critiques
Generates final refined response

OpenAI Endpoint Support

The router can proxy requests to OpenAI-compatible endpoints alongside Ollama endpoints. It automatically:

Detects OpenAI endpoints (those containing /v1)
Converts between Ollama and OpenAI response formats
Handles authentication with API keys
Maintains consistent behavior across endpoint types

Reactive Context-Shift

When a backend returns a exceed_context_size_error (context window exceeded), the router automatically trims the conversation history and retries rather than surfacing the error to the client.

How it works:

The error body contains n_ctx (the model's context limit) and n_prompt_tokens (the actual token count as measured by the backend).
_calibrated_trim_target() computes a tiktoken-scale trim target using the delta between actual tokens and the context limit, correcting for the fact that tiktoken counts fewer tokens than the backend tokeniser does.
_trim_messages_for_context() implements a sliding-window drop: system messages are always preserved; the oldest non-system messages are evicted first (FIFO) until the estimated token count fits the target. The most recent message is never dropped. After trimming, leading assistant/tool messages are removed to satisfy chat-template requirements (first non-system message must be a user message).
Two retry attempts are made:
- Retry 1 — trimmed messages, original tool definitions.
- Retry 2 — trimmed messages with tool definitions also stripped (handles cases where tool schemas alone consume too many tokens).

Proactive pre-trimming:

Once a context overflow has been observed for an endpoint/model pair whose n_ctx ≤ 32 768, the router records that limit in _endpoint_nctx. Subsequent requests to the same pair are pre-trimmed before being sent, avoiding the round-trip to the backend entirely for small-context models.

Reactive SSE Push

The /api/usage-stream endpoint delivers real-time usage updates using a pub/sub push model rather than client polling.

Mechanism:

subscribe() creates a bounded asyncio.Queue (capacity 10) and registers it in _subscribers.
Whenever usage_counts or token_usage_counts change — on every increment_usage, decrement_usage, or token-worker flush — _capture_snapshot() serialises the current state to JSON while the caller still holds the relevant lock, then _distribute_snapshot() pushes the snapshot to every registered queue outside the lock.
If a subscriber's queue is full (slow client), the oldest undelivered snapshot is evicted before the new one is enqueued, so fast producers never block on slow consumers.
unsubscribe() removes the queue when the SSE connection closes; close_all_sse_queues() sends a None sentinel to all subscribers during router shutdown.

Performance Considerations

Concurrency Model

Max concurrent connections: Configurable per endpoint-model pair
Connection pooling: Reuses aiohttp connections
Async I/O: All operations are non-blocking
Backpressure handling: Queues requests when endpoints are saturated

Caching Strategy

Short TTL for loaded models (30s): Ensures quick detection of model loading/unloading
Longer TTL for available models (300s): Reduces unnecessary API calls
Error caching (10s): Prevents thundering herd during outages

Memory Management

Write-behind pattern: Token counts buffered in memory, flushed periodically
Queue-based SSE: Bounded per-subscriber queues (capacity 10) with oldest-eviction — see Reactive SSE Push
Automatic cleanup: Zero connection counts are removed from tracking

Error Handling

Transient Errors

Temporary connection failures are cached for 10 seconds
During cache period, endpoint is treated as unavailable
After cache expires, endpoint is re-tested

Permanent Errors

Invalid model names result in clear error messages
Missing required fields return 400 Bad Request
Unreachable endpoints are reported with detailed connection issues

Health Monitoring

The /health endpoint provides comprehensive health status:

{
  "status": "ok" | "error",
  "endpoints": {
    "http://endpoint1:11434": {
      "status": "ok" | "error",
      "version": "string" | "detail": "error message"
    }
  }
}

Database Schema

The router uses SQLite for persistent storage:

CREATE TABLE token_counts (
    endpoint TEXT NOT NULL,
    model TEXT NOT NULL,
    input_tokens INTEGER NOT NULL,
    output_tokens INTEGER NOT NULL,
    total_tokens INTEGER NOT NULL,
    PRIMARY KEY (endpoint, model)
);

CREATE TABLE time_series (
    endpoint TEXT NOT NULL,
    model TEXT NOT NULL,
    input_tokens INTEGER NOT NULL,
    output_tokens INTEGER NOT NULL,
    total_tokens INTEGER NOT NULL,
    timestamp INTEGER NOT NULL,
    PRIMARY KEY (endpoint, model, timestamp)
);

Scaling Considerations

Horizontal Scaling

Multiple router instances can run behind a load balancer
Each instance maintains its own connection tracking
Stateless design allows for easy scaling

Vertical Scaling

Connection limits can be increased via aiohttp connector settings
Memory usage grows with number of tracked connections
Token buffer flushing interval can be adjusted

Security

Authentication

API keys are stored in config.yaml (can use environment variables)
Keys are passed to endpoints via Authorization headers
No authentication required for router itself (can be added via middleware)

Data Protection

All communication uses TLS when configured
No sensitive data logged (except in error messages)
Database contains only token counts and timestamps

Monitoring and Observability

Metrics Endpoints

/api/usage: Current connection counts
/api/token_counts: Aggregated token usage
/api/stats: Detailed statistics per model
/api/config: Endpoint configuration and status
/api/usage-stream: Real-time usage updates via SSE

Logging

Connection errors are logged with detailed context
Endpoint selection decisions are logged
Token counting operations are logged at debug level

Future Enhancements

Potential areas for improvement:

Kubernetes operator for automatic deployment
Prometheus metrics endpoint
Distributed connection tracking (Redis)
Request retry logic with exponential backoff
Circuit breaker pattern for failing endpoints
Rate limiting per client

10 KiB Raw Blame History