10 KiB
NOMYO Router Architecture
Overview
NOMYO Router is a transparent proxy for Ollama with model deployment-aware routing. It sits between your frontend application and Ollama backend(s), providing intelligent request routing based on model availability and load balancing.
Core Components
1. Request Routing Engine
The router's core intelligence is in the choose_endpoint() function, which implements a sophisticated routing algorithm:
async def choose_endpoint(model: str) -> str:
"""
Endpoint selection algorithm:
1. Query all endpoints for advertised models
2. Filter endpoints that advertise the requested model
3. Among candidates, find those with the model loaded AND free slots
4. If none loaded with free slots, pick any with free slots
5. If all saturated, pick endpoint with lowest current usage
6. If no endpoint advertises the model, raise error
"""
2. Connection Tracking
The router maintains real-time connection counts per endpoint-model pair:
usage_counts: Dict[str, Dict[str, int]] = defaultdict(lambda: defaultdict(int))
This allows for:
- Load-aware routing: Requests are routed to endpoints with available capacity
- Model-aware routing: Requests are routed to endpoints where the model is already loaded
- Efficient resource utilization: Minimizes model loading/unloading operations
3. Caching Layer
Three types of caches improve performance:
- Models cache (
_models_cache): Caches available models per endpoint (300s TTL) - Loaded models cache (
_loaded_models_cache): Caches currently loaded models (30s TTL) - Error cache (
_error_cache): Caches transient errors (10s TTL)
4. Token Tracking System
Comprehensive token usage tracking:
token_buffer: dict[str, dict[str, tuple[int, int]]] = defaultdict(lambda: defaultdict(lambda: (0, 0)))
time_series_buffer: list[dict[str, int | str]] = []
Features:
- Real-time token counting for input/output tokens
- Periodic flushing to SQLite database (every 10 seconds)
- Time-series data for historical analysis
- Per-endpoint, per-model breakdown
5. API Compatibility Layer
The router supports multiple API formats:
- Ollama API: Native
/api/generate,/api/chat,/api/embedendpoints - OpenAI API: Compatible
/v1/chat/completions,/v1/completions,/v1/embeddingsendpoints - Transparent conversion: Responses are converted between formats as needed
Data Flow
Request Processing
- Ingress: Frontend sends request to router
- Endpoint Selection: Router determines optimal endpoint
- Request Forwarding: Request sent to selected Ollama endpoint
- Response Streaming: Response streamed back to frontend
- Usage Tracking: Connection and token counts updated
- Egress: Complete response returned to frontend
Connection Management
sequenceDiagram
participant Frontend
participant Router
participant Endpoint1
participant Endpoint2
Frontend->>Router: Request for model X
Router->>Endpoint1: Check if model X is loaded
Router->>Endpoint2: Check if model X is loaded
alt Endpoint1 has model X loaded
Router->>Endpoint1: Forward request
Endpoint1->>Router: Stream response
Router->>Frontend: Stream response
else Endpoint2 has model X loaded
Router->>Endpoint2: Forward request
Endpoint2->>Router: Stream response
Router->>Frontend: Stream response
else No endpoint has model X loaded
Router->>Endpoint1: Forward request (will trigger load)
Endpoint1->>Router: Stream response
Router->>Frontend: Stream response
end
Advanced Features
Multiple Opinions Ensemble (MOE)
When the user prefixes a model name with moe-, the router activates the MOE system:
- Generates 3 responses from different endpoints
- Generates 3 critiques of those responses
- Selects the best response based on critiques
- Generates final refined response
OpenAI Endpoint Support
The router can proxy requests to OpenAI-compatible endpoints alongside Ollama endpoints. It automatically:
- Detects OpenAI endpoints (those containing
/v1) - Converts between Ollama and OpenAI response formats
- Handles authentication with API keys
- Maintains consistent behavior across endpoint types
Reactive Context-Shift
When a backend returns a exceed_context_size_error (context window exceeded), the router automatically trims the conversation history and retries rather than surfacing the error to the client.
How it works:
- The error body contains
n_ctx(the model's context limit) andn_prompt_tokens(the actual token count as measured by the backend). _calibrated_trim_target()computes a tiktoken-scale trim target using the delta between actual tokens and the context limit, correcting for the fact that tiktoken counts fewer tokens than the backend tokeniser does._trim_messages_for_context()implements a sliding-window drop: system messages are always preserved; the oldest non-system messages are evicted first (FIFO) until the estimated token count fits the target. The most recent message is never dropped. After trimming, leading assistant/tool messages are removed to satisfy chat-template requirements (first non-system message must be a user message).- Two retry attempts are made:
- Retry 1 — trimmed messages, original tool definitions.
- Retry 2 — trimmed messages with tool definitions also stripped (handles cases where tool schemas alone consume too many tokens).
Proactive pre-trimming:
Once a context overflow has been observed for an endpoint/model pair whose n_ctx ≤ 32 768, the router records that limit in _endpoint_nctx. Subsequent requests to the same pair are pre-trimmed before being sent, avoiding the round-trip to the backend entirely for small-context models.
Reactive SSE Push
The /api/usage-stream endpoint delivers real-time usage updates using a pub/sub push model rather than client polling.
Mechanism:
subscribe()creates a boundedasyncio.Queue(capacity 10) and registers it in_subscribers.- Whenever
usage_countsortoken_usage_countschange — on everyincrement_usage,decrement_usage, or token-worker flush —_capture_snapshot()serialises the current state to JSON while the caller still holds the relevant lock, then_distribute_snapshot()pushes the snapshot to every registered queue outside the lock. - If a subscriber's queue is full (slow client), the oldest undelivered snapshot is evicted before the new one is enqueued, so fast producers never block on slow consumers.
unsubscribe()removes the queue when the SSE connection closes;close_all_sse_queues()sends aNonesentinel to all subscribers during router shutdown.
Performance Considerations
Concurrency Model
- Max concurrent connections: Configurable per endpoint-model pair
- Connection pooling: Reuses aiohttp connections
- Async I/O: All operations are non-blocking
- Backpressure handling: Queues requests when endpoints are saturated
Caching Strategy
- Short TTL for loaded models (30s): Ensures quick detection of model loading/unloading
- Longer TTL for available models (300s): Reduces unnecessary API calls
- Error caching (10s): Prevents thundering herd during outages
Memory Management
- Write-behind pattern: Token counts buffered in memory, flushed periodically
- Queue-based SSE: Bounded per-subscriber queues (capacity 10) with oldest-eviction — see Reactive SSE Push
- Automatic cleanup: Zero connection counts are removed from tracking
Error Handling
Transient Errors
- Temporary connection failures are cached for 10 seconds
- During cache period, endpoint is treated as unavailable
- After cache expires, endpoint is re-tested
Permanent Errors
- Invalid model names result in clear error messages
- Missing required fields return 400 Bad Request
- Unreachable endpoints are reported with detailed connection issues
Health Monitoring
The /health endpoint provides comprehensive health status:
{
"status": "ok" | "error",
"endpoints": {
"http://endpoint1:11434": {
"status": "ok" | "error",
"version": "string" | "detail": "error message"
}
}
}
Database Schema
The router uses SQLite for persistent storage:
CREATE TABLE token_counts (
endpoint TEXT NOT NULL,
model TEXT NOT NULL,
input_tokens INTEGER NOT NULL,
output_tokens INTEGER NOT NULL,
total_tokens INTEGER NOT NULL,
PRIMARY KEY (endpoint, model)
);
CREATE TABLE time_series (
endpoint TEXT NOT NULL,
model TEXT NOT NULL,
input_tokens INTEGER NOT NULL,
output_tokens INTEGER NOT NULL,
total_tokens INTEGER NOT NULL,
timestamp INTEGER NOT NULL,
PRIMARY KEY (endpoint, model, timestamp)
);
Scaling Considerations
Horizontal Scaling
- Multiple router instances can run behind a load balancer
- Each instance maintains its own connection tracking
- Stateless design allows for easy scaling
Vertical Scaling
- Connection limits can be increased via aiohttp connector settings
- Memory usage grows with number of tracked connections
- Token buffer flushing interval can be adjusted
Security
Authentication
- API keys are stored in config.yaml (can use environment variables)
- Keys are passed to endpoints via Authorization headers
- No authentication required for router itself (can be added via middleware)
Data Protection
- All communication uses TLS when configured
- No sensitive data logged (except in error messages)
- Database contains only token counts and timestamps
Monitoring and Observability
Metrics Endpoints
/api/usage: Current connection counts/api/token_counts: Aggregated token usage/api/stats: Detailed statistics per model/api/config: Endpoint configuration and status/api/usage-stream: Real-time usage updates via SSE
Logging
- Connection errors are logged with detailed context
- Endpoint selection decisions are logged
- Token counting operations are logged at debug level
Future Enhancements
Potential areas for improvement:
- Kubernetes operator for automatic deployment
- Prometheus metrics endpoint
- Distributed connection tracking (Redis)
- Request retry logic with exponential backoff
- Circuit breaker pattern for failing endpoints
- Rate limiting per client