nomyo-router/doc/architecture.md
2026-04-14 09:17:33 +02:00

10 KiB

NOMYO Router Architecture

Overview

NOMYO Router is a transparent proxy for Ollama with model deployment-aware routing. It sits between your frontend application and Ollama backend(s), providing intelligent request routing based on model availability and load balancing.

Core Components

1. Request Routing Engine

The router's core intelligence is in the choose_endpoint() function, which implements a sophisticated routing algorithm:

async def choose_endpoint(model: str) -> str:
    """
    Endpoint selection algorithm:
    1. Query all endpoints for advertised models
    2. Filter endpoints that advertise the requested model
    3. Among candidates, find those with the model loaded AND free slots
    4. If none loaded with free slots, pick any with free slots
    5. If all saturated, pick endpoint with lowest current usage
    6. If no endpoint advertises the model, raise error
    """

2. Connection Tracking

The router maintains real-time connection counts per endpoint-model pair:

usage_counts: Dict[str, Dict[str, int]] = defaultdict(lambda: defaultdict(int))

This allows for:

  • Load-aware routing: Requests are routed to endpoints with available capacity
  • Model-aware routing: Requests are routed to endpoints where the model is already loaded
  • Efficient resource utilization: Minimizes model loading/unloading operations

3. Caching Layer

Three types of caches improve performance:

  • Models cache (_models_cache): Caches available models per endpoint (300s TTL)
  • Loaded models cache (_loaded_models_cache): Caches currently loaded models (30s TTL)
  • Error cache (_error_cache): Caches transient errors (10s TTL)

4. Token Tracking System

Comprehensive token usage tracking:

token_buffer: dict[str, dict[str, tuple[int, int]]] = defaultdict(lambda: defaultdict(lambda: (0, 0)))
time_series_buffer: list[dict[str, int | str]] = []

Features:

  • Real-time token counting for input/output tokens
  • Periodic flushing to SQLite database (every 10 seconds)
  • Time-series data for historical analysis
  • Per-endpoint, per-model breakdown

5. API Compatibility Layer

The router supports multiple API formats:

  • Ollama API: Native /api/generate, /api/chat, /api/embed endpoints
  • OpenAI API: Compatible /v1/chat/completions, /v1/completions, /v1/embeddings endpoints
  • Transparent conversion: Responses are converted between formats as needed

Data Flow

Request Processing

  1. Ingress: Frontend sends request to router
  2. Endpoint Selection: Router determines optimal endpoint
  3. Request Forwarding: Request sent to selected Ollama endpoint
  4. Response Streaming: Response streamed back to frontend
  5. Usage Tracking: Connection and token counts updated
  6. Egress: Complete response returned to frontend

Connection Management

sequenceDiagram
    participant Frontend
    participant Router
    participant Endpoint1
    participant Endpoint2

    Frontend->>Router: Request for model X
    Router->>Endpoint1: Check if model X is loaded
    Router->>Endpoint2: Check if model X is loaded
    alt Endpoint1 has model X loaded
        Router->>Endpoint1: Forward request
        Endpoint1->>Router: Stream response
        Router->>Frontend: Stream response
    else Endpoint2 has model X loaded
        Router->>Endpoint2: Forward request
        Endpoint2->>Router: Stream response
        Router->>Frontend: Stream response
    else No endpoint has model X loaded
        Router->>Endpoint1: Forward request (will trigger load)
        Endpoint1->>Router: Stream response
        Router->>Frontend: Stream response
    end

Advanced Features

Multiple Opinions Ensemble (MOE)

When the user prefixes a model name with moe-, the router activates the MOE system:

  1. Generates 3 responses from different endpoints
  2. Generates 3 critiques of those responses
  3. Selects the best response based on critiques
  4. Generates final refined response

OpenAI Endpoint Support

The router can proxy requests to OpenAI-compatible endpoints alongside Ollama endpoints. It automatically:

  • Detects OpenAI endpoints (those containing /v1)
  • Converts between Ollama and OpenAI response formats
  • Handles authentication with API keys
  • Maintains consistent behavior across endpoint types

Reactive Context-Shift

When a backend returns a exceed_context_size_error (context window exceeded), the router automatically trims the conversation history and retries rather than surfacing the error to the client.

How it works:

  1. The error body contains n_ctx (the model's context limit) and n_prompt_tokens (the actual token count as measured by the backend).
  2. _calibrated_trim_target() computes a tiktoken-scale trim target using the delta between actual tokens and the context limit, correcting for the fact that tiktoken counts fewer tokens than the backend tokeniser does.
  3. _trim_messages_for_context() implements a sliding-window drop: system messages are always preserved; the oldest non-system messages are evicted first (FIFO) until the estimated token count fits the target. The most recent message is never dropped. After trimming, leading assistant/tool messages are removed to satisfy chat-template requirements (first non-system message must be a user message).
  4. Two retry attempts are made:
    • Retry 1 — trimmed messages, original tool definitions.
    • Retry 2 — trimmed messages with tool definitions also stripped (handles cases where tool schemas alone consume too many tokens).

Proactive pre-trimming:

Once a context overflow has been observed for an endpoint/model pair whose n_ctx ≤ 32 768, the router records that limit in _endpoint_nctx. Subsequent requests to the same pair are pre-trimmed before being sent, avoiding the round-trip to the backend entirely for small-context models.

Reactive SSE Push

The /api/usage-stream endpoint delivers real-time usage updates using a pub/sub push model rather than client polling.

Mechanism:

  • subscribe() creates a bounded asyncio.Queue (capacity 10) and registers it in _subscribers.
  • Whenever usage_counts or token_usage_counts change — on every increment_usage, decrement_usage, or token-worker flush — _capture_snapshot() serialises the current state to JSON while the caller still holds the relevant lock, then _distribute_snapshot() pushes the snapshot to every registered queue outside the lock.
  • If a subscriber's queue is full (slow client), the oldest undelivered snapshot is evicted before the new one is enqueued, so fast producers never block on slow consumers.
  • unsubscribe() removes the queue when the SSE connection closes; close_all_sse_queues() sends a None sentinel to all subscribers during router shutdown.

Performance Considerations

Concurrency Model

  • Max concurrent connections: Configurable per endpoint-model pair
  • Connection pooling: Reuses aiohttp connections
  • Async I/O: All operations are non-blocking
  • Backpressure handling: Queues requests when endpoints are saturated

Caching Strategy

  • Short TTL for loaded models (30s): Ensures quick detection of model loading/unloading
  • Longer TTL for available models (300s): Reduces unnecessary API calls
  • Error caching (10s): Prevents thundering herd during outages

Memory Management

  • Write-behind pattern: Token counts buffered in memory, flushed periodically
  • Queue-based SSE: Bounded per-subscriber queues (capacity 10) with oldest-eviction — see Reactive SSE Push
  • Automatic cleanup: Zero connection counts are removed from tracking

Error Handling

Transient Errors

  • Temporary connection failures are cached for 10 seconds
  • During cache period, endpoint is treated as unavailable
  • After cache expires, endpoint is re-tested

Permanent Errors

  • Invalid model names result in clear error messages
  • Missing required fields return 400 Bad Request
  • Unreachable endpoints are reported with detailed connection issues

Health Monitoring

The /health endpoint provides comprehensive health status:

{
  "status": "ok" | "error",
  "endpoints": {
    "http://endpoint1:11434": {
      "status": "ok" | "error",
      "version": "string" | "detail": "error message"
    }
  }
}

Database Schema

The router uses SQLite for persistent storage:

CREATE TABLE token_counts (
    endpoint TEXT NOT NULL,
    model TEXT NOT NULL,
    input_tokens INTEGER NOT NULL,
    output_tokens INTEGER NOT NULL,
    total_tokens INTEGER NOT NULL,
    PRIMARY KEY (endpoint, model)
);

CREATE TABLE time_series (
    endpoint TEXT NOT NULL,
    model TEXT NOT NULL,
    input_tokens INTEGER NOT NULL,
    output_tokens INTEGER NOT NULL,
    total_tokens INTEGER NOT NULL,
    timestamp INTEGER NOT NULL,
    PRIMARY KEY (endpoint, model, timestamp)
);

Scaling Considerations

Horizontal Scaling

  • Multiple router instances can run behind a load balancer
  • Each instance maintains its own connection tracking
  • Stateless design allows for easy scaling

Vertical Scaling

  • Connection limits can be increased via aiohttp connector settings
  • Memory usage grows with number of tracked connections
  • Token buffer flushing interval can be adjusted

Security

Authentication

  • API keys are stored in config.yaml (can use environment variables)
  • Keys are passed to endpoints via Authorization headers
  • No authentication required for router itself (can be added via middleware)

Data Protection

  • All communication uses TLS when configured
  • No sensitive data logged (except in error messages)
  • Database contains only token counts and timestamps

Monitoring and Observability

Metrics Endpoints

  • /api/usage: Current connection counts
  • /api/token_counts: Aggregated token usage
  • /api/stats: Detailed statistics per model
  • /api/config: Endpoint configuration and status
  • /api/usage-stream: Real-time usage updates via SSE

Logging

  • Connection errors are logged with detailed context
  • Endpoint selection decisions are logged
  • Token counting operations are logged at debug level

Future Enhancements

Potential areas for improvement:

  • Kubernetes operator for automatic deployment
  • Prometheus metrics endpoint
  • Distributed connection tracking (Redis)
  • Request retry logic with exponential backoff
  • Circuit breaker pattern for failing endpoints
  • Rate limiting per client