nomyo-router/doc/architecture.md

# NOMYO Router Architecture

## Overview

NOMYO Router is a transparent proxy for Ollama with model deployment-aware routing. It sits between your frontend application and Ollama backend(s), providing intelligent request routing based on model availability and load balancing.

## Core Components

### 1. Request Routing Engine

The router's core intelligence is in the `choose_endpoint()` function, which implements a sophisticated routing algorithm:

```python
async def choose_endpoint(model: str) -> str:
    """
    Endpoint selection algorithm:
    1. Query all endpoints for advertised models
    2. Filter endpoints that advertise the requested model
    3. Among candidates, find those with the model loaded AND free slots
    4. If none loaded with free slots, pick any with free slots
    5. If all saturated, pick endpoint with lowest current usage
    6. If no endpoint advertises the model, raise error
    """
```

### 2. Connection Tracking

The router maintains real-time connection counts per endpoint-model pair:

```python
usage_counts: Dict[str, Dict[str, int]] = defaultdict(lambda: defaultdict(int))
```

This allows for:

- **Load-aware routing**: Requests are routed to endpoints with available capacity
- **Model-aware routing**: Requests are routed to endpoints where the model is already loaded
- **Efficient resource utilization**: Minimizes model loading/unloading operations

### 3. Caching Layer

Three types of caches improve performance:

- **Models cache** (`_models_cache`): Caches available models per endpoint (300s TTL)
- **Loaded models cache** (`_loaded_models_cache`): Caches currently loaded models (30s TTL)
- **Error cache** (`_error_cache`): Caches transient errors (10s TTL)

### 4. Token Tracking System

Comprehensive token usage tracking:

```python
token_buffer: dict[str, dict[str, tuple[int, int]]] = defaultdict(lambda: defaultdict(lambda: (0, 0)))
time_series_buffer: list[dict[str, int | str]] = []
```

Features:

- Real-time token counting for input/output tokens
- Periodic flushing to SQLite database (every 10 seconds)
- Time-series data for historical analysis
- Per-endpoint, per-model breakdown

### 5. API Compatibility Layer

The router supports multiple API formats:

- **Ollama API**: Native `/api/generate`, `/api/chat`, `/api/embed` endpoints
- **OpenAI API**: Compatible `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings` endpoints
- **Transparent conversion**: Responses are converted between formats as needed

## Data Flow

### Request Processing

1. **Ingress**: Frontend sends request to router
2. **Endpoint Selection**: Router determines optimal endpoint
3. **Request Forwarding**: Request sent to selected Ollama endpoint
4. **Response Streaming**: Response streamed back to frontend
5. **Usage Tracking**: Connection and token counts updated
6. **Egress**: Complete response returned to frontend

### Connection Management

```mermaid
sequenceDiagram
    participant Frontend
    participant Router
    participant Endpoint1
    participant Endpoint2

    Frontend->>Router: Request for model X
    Router->>Endpoint1: Check if model X is loaded
    Router->>Endpoint2: Check if model X is loaded
    alt Endpoint1 has model X loaded
        Router->>Endpoint1: Forward request
        Endpoint1->>Router: Stream response
        Router->>Frontend: Stream response
    else Endpoint2 has model X loaded
        Router->>Endpoint2: Forward request
        Endpoint2->>Router: Stream response
        Router->>Frontend: Stream response
    else No endpoint has model X loaded
        Router->>Endpoint1: Forward request (will trigger load)
        Endpoint1->>Router: Stream response
        Router->>Frontend: Stream response
    end
```

## Advanced Features

### Multiple Opinions Ensemble (MOE)

When the user prefixes a model name with `moe-`, the router activates the MOE system:

1. Generates 3 responses from different endpoints
2. Generates 3 critiques of those responses
3. Selects the best response based on critiques
4. Generates final refined response

### OpenAI Endpoint Support

The router can proxy requests to OpenAI-compatible endpoints alongside Ollama endpoints. It automatically:

- Detects OpenAI endpoints (those containing `/v1`)
- Converts between Ollama and OpenAI response formats
- Handles authentication with API keys
- Maintains consistent behavior across endpoint types

### Reactive Context-Shift

When a backend returns a `exceed_context_size_error` (context window exceeded), the router automatically trims the conversation history and retries rather than surfacing the error to the client.

**How it works:**

1. The error body contains `n_ctx` (the model's context limit) and `n_prompt_tokens` (the actual token count as measured by the backend).
2. `_calibrated_trim_target()` computes a tiktoken-scale trim target using the *delta* between actual tokens and the context limit, correcting for the fact that tiktoken counts fewer tokens than the backend tokeniser does.
3. `_trim_messages_for_context()` implements a sliding-window drop: system messages are always preserved; the oldest non-system messages are evicted first (FIFO) until the estimated token count fits the target. The most recent message is never dropped. After trimming, leading assistant/tool messages are removed to satisfy chat-template requirements (first non-system message must be a user message).
4. Two retry attempts are made:
   - **Retry 1** — trimmed messages, original tool definitions.
   - **Retry 2** — trimmed messages with tool definitions also stripped (handles cases where tool schemas alone consume too many tokens).

**Proactive pre-trimming:**

Once a context overflow has been observed for an endpoint/model pair whose `n_ctx` ≤ 32 768, the router records that limit in `_endpoint_nctx`. Subsequent requests to the same pair are pre-trimmed before being sent, avoiding the round-trip to the backend entirely for small-context models.

### Reactive SSE Push

The `/api/usage-stream` endpoint delivers real-time usage updates using a pub/sub push model rather than client polling.

**Mechanism:**

- `subscribe()` creates a bounded `asyncio.Queue` (capacity 10) and registers it in `_subscribers`.
- Whenever `usage_counts` or `token_usage_counts` change — on every `increment_usage`, `decrement_usage`, or token-worker flush — `_capture_snapshot()` serialises the current state to JSON while the caller still holds the relevant lock, then `_distribute_snapshot()` pushes the snapshot to every registered queue outside the lock.
- If a subscriber's queue is full (slow client), the oldest undelivered snapshot is evicted before the new one is enqueued, so fast producers never block on slow consumers.
- `unsubscribe()` removes the queue when the SSE connection closes; `close_all_sse_queues()` sends a `None` sentinel to all subscribers during router shutdown.

## Performance Considerations

### Concurrency Model

- **Max concurrent connections**: Configurable per endpoint-model pair
- **Connection pooling**: Reuses aiohttp connections
- **Async I/O**: All operations are non-blocking
- **Backpressure handling**: Queues requests when endpoints are saturated

### Caching Strategy

- **Short TTL for loaded models** (30s): Ensures quick detection of model loading/unloading
- **Longer TTL for available models** (300s): Reduces unnecessary API calls
- **Error caching** (10s): Prevents thundering herd during outages

### Memory Management

- **Write-behind pattern**: Token counts buffered in memory, flushed periodically
- **Queue-based SSE**: Bounded per-subscriber queues (capacity 10) with oldest-eviction — see [Reactive SSE Push](#reactive-sse-push)
- **Automatic cleanup**: Zero connection counts are removed from tracking

## Error Handling

### Transient Errors

- Temporary connection failures are cached for 10 seconds
- During cache period, endpoint is treated as unavailable
- After cache expires, endpoint is re-tested

### Permanent Errors

- Invalid model names result in clear error messages
- Missing required fields return 400 Bad Request
- Unreachable endpoints are reported with detailed connection issues

### Health Monitoring

The `/health` endpoint provides comprehensive health status:

```json
{
  "status": "ok" | "error",
  "endpoints": {
    "http://endpoint1:11434": {
      "status": "ok" | "error",
      "version": "string" | "detail": "error message"
    }
  }
}
```

## Database Schema

The router uses SQLite for persistent storage:

```sql
CREATE TABLE token_counts (
    endpoint TEXT NOT NULL,
    model TEXT NOT NULL,
    input_tokens INTEGER NOT NULL,
    output_tokens INTEGER NOT NULL,
    total_tokens INTEGER NOT NULL,
    PRIMARY KEY (endpoint, model)
);

CREATE TABLE time_series (
    endpoint TEXT NOT NULL,
    model TEXT NOT NULL,
    input_tokens INTEGER NOT NULL,
    output_tokens INTEGER NOT NULL,
    total_tokens INTEGER NOT NULL,
    timestamp INTEGER NOT NULL,
    PRIMARY KEY (endpoint, model, timestamp)
);
```

## Scaling Considerations

### Horizontal Scaling

- Multiple router instances can run behind a load balancer
- Each instance maintains its own connection tracking
- Stateless design allows for easy scaling

### Vertical Scaling

- Connection limits can be increased via aiohttp connector settings
- Memory usage grows with number of tracked connections
- Token buffer flushing interval can be adjusted

## Security

### Authentication

- API keys are stored in config.yaml (can use environment variables)
- Keys are passed to endpoints via Authorization headers
- No authentication required for router itself (can be added via middleware)

### Data Protection

- All communication uses TLS when configured
- No sensitive data logged (except in error messages)
- Database contains only token counts and timestamps

## Monitoring and Observability

### Metrics Endpoints

- `/api/usage`: Current connection counts
- `/api/token_counts`: Aggregated token usage
- `/api/stats`: Detailed statistics per model
- `/api/config`: Endpoint configuration and status
- `/api/usage-stream`: Real-time usage updates via SSE

### Logging

- Connection errors are logged with detailed context
- Endpoint selection decisions are logged
- Token counting operations are logged at debug level

## Future Enhancements

Potential areas for improvement:

- Kubernetes operator for automatic deployment
- Prometheus metrics endpoint
- Distributed connection tracking (Redis)
- Request retry logic with exponential backoff
- Circuit breaker pattern for failing endpoints
- Rate limiting per client
feat: added buffer_lock to prevent race condition in high concurrency scenarios added documentation 2026-01-05 17:16:31 +01:00			`# NOMYO Router Architecture`

			`## Overview`

			`NOMYO Router is a transparent proxy for Ollama with model deployment-aware routing. It sits between your frontend application and Ollama backend(s), providing intelligent request routing based on model availability and load balancing.`

			`## Core Components`

			`### 1. Request Routing Engine`

			The router's core intelligence is in the `choose_endpoint()` function, which implements a sophisticated routing algorithm:

			```python
			`async def choose_endpoint(model: str) -> str:`
			`"""`
			`Endpoint selection algorithm:`
			`1. Query all endpoints for advertised models`
			`2. Filter endpoints that advertise the requested model`
			`3. Among candidates, find those with the model loaded AND free slots`
			`4. If none loaded with free slots, pick any with free slots`
			`5. If all saturated, pick endpoint with lowest current usage`
			`6. If no endpoint advertises the model, raise error`
			`"""`
			```

			`### 2. Connection Tracking`

			`The router maintains real-time connection counts per endpoint-model pair:`

			```python
			`usage_counts: Dict[str, Dict[str, int]] = defaultdict(lambda: defaultdict(int))`
			```

			`This allows for:`

			`- Load-aware routing: Requests are routed to endpoints with available capacity`
			`- Model-aware routing: Requests are routed to endpoints where the model is already loaded`
			`- Efficient resource utilization: Minimizes model loading/unloading operations`

			`### 3. Caching Layer`

			`Three types of caches improve performance:`

			- Models cache (`_models_cache`): Caches available models per endpoint (300s TTL)
			- Loaded models cache (`_loaded_models_cache`): Caches currently loaded models (30s TTL)
			- Error cache (`_error_cache`): Caches transient errors (10s TTL)

			`### 4. Token Tracking System`

			`Comprehensive token usage tracking:`

			```python
			`token_buffer: dict[str, dict[str, tuple[int, int]]] = defaultdict(lambda: defaultdict(lambda: (0, 0)))`
			`time_series_buffer: list[dict[str, int \| str]] = []`
			```

			`Features:`

			`- Real-time token counting for input/output tokens`
			`- Periodic flushing to SQLite database (every 10 seconds)`
			`- Time-series data for historical analysis`
			`- Per-endpoint, per-model breakdown`

			`### 5. API Compatibility Layer`

			`The router supports multiple API formats:`

			- Ollama API: Native `/api/generate`, `/api/chat`, `/api/embed` endpoints
			- OpenAI API: Compatible `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings` endpoints
			`- Transparent conversion: Responses are converted between formats as needed`

			`## Data Flow`

			`### Request Processing`

			`1. Ingress: Frontend sends request to router`
			`2. Endpoint Selection: Router determines optimal endpoint`
			`3. Request Forwarding: Request sent to selected Ollama endpoint`
			`4. Response Streaming: Response streamed back to frontend`
			`5. Usage Tracking: Connection and token counts updated`
			`6. Egress: Complete response returned to frontend`

			`### Connection Management`

			```mermaid
			`sequenceDiagram`
			`participant Frontend`
			`participant Router`
			`participant Endpoint1`
			`participant Endpoint2`

			`Frontend->>Router: Request for model X`
			`Router->>Endpoint1: Check if model X is loaded`
			`Router->>Endpoint2: Check if model X is loaded`
			`alt Endpoint1 has model X loaded`
			`Router->>Endpoint1: Forward request`
			`Endpoint1->>Router: Stream response`
			`Router->>Frontend: Stream response`
			`else Endpoint2 has model X loaded`
			`Router->>Endpoint2: Forward request`
			`Endpoint2->>Router: Stream response`
			`Router->>Frontend: Stream response`
			`else No endpoint has model X loaded`
			`Router->>Endpoint1: Forward request (will trigger load)`
			`Endpoint1->>Router: Stream response`
			`Router->>Frontend: Stream response`
			`end`
			```

			`## Advanced Features`

			`### Multiple Opinions Ensemble (MOE)`

			When the user prefixes a model name with `moe-`, the router activates the MOE system:

			`1. Generates 3 responses from different endpoints`
			`2. Generates 3 critiques of those responses`
			`3. Selects the best response based on critiques`
			`4. Generates final refined response`

			`### OpenAI Endpoint Support`

			`The router can proxy requests to OpenAI-compatible endpoints alongside Ollama endpoints. It automatically:`

			- Detects OpenAI endpoints (those containing `/v1`)
			`- Converts between Ollama and OpenAI response formats`
			`- Handles authentication with API keys`
			`- Maintains consistent behavior across endpoint types`

doc: feature updates 2026-04-14 09:17:33 +02:00			`### Reactive Context-Shift`

			When a backend returns a `exceed_context_size_error` (context window exceeded), the router automatically trims the conversation history and retries rather than surfacing the error to the client.

			`How it works:`

			1. The error body contains `n_ctx` (the model's context limit) and `n_prompt_tokens` (the actual token count as measured by the backend).
			2. `_calibrated_trim_target()` computes a tiktoken-scale trim target using the delta between actual tokens and the context limit, correcting for the fact that tiktoken counts fewer tokens than the backend tokeniser does.
			3. `_trim_messages_for_context()` implements a sliding-window drop: system messages are always preserved; the oldest non-system messages are evicted first (FIFO) until the estimated token count fits the target. The most recent message is never dropped. After trimming, leading assistant/tool messages are removed to satisfy chat-template requirements (first non-system message must be a user message).
			`4. Two retry attempts are made:`
			`- Retry 1 — trimmed messages, original tool definitions.`
			`- Retry 2 — trimmed messages with tool definitions also stripped (handles cases where tool schemas alone consume too many tokens).`

			`Proactive pre-trimming:`

			Once a context overflow has been observed for an endpoint/model pair whose `n_ctx` ≤ 32 768, the router records that limit in `_endpoint_nctx`. Subsequent requests to the same pair are pre-trimmed before being sent, avoiding the round-trip to the backend entirely for small-context models.

			`### Reactive SSE Push`

			The `/api/usage-stream` endpoint delivers real-time usage updates using a pub/sub push model rather than client polling.

			`Mechanism:`

			- `subscribe()` creates a bounded `asyncio.Queue` (capacity 10) and registers it in `_subscribers`.
			- Whenever `usage_counts` or `token_usage_counts` change — on every `increment_usage`, `decrement_usage`, or token-worker flush — `_capture_snapshot()` serialises the current state to JSON while the caller still holds the relevant lock, then `_distribute_snapshot()` pushes the snapshot to every registered queue outside the lock.
			`- If a subscriber's queue is full (slow client), the oldest undelivered snapshot is evicted before the new one is enqueued, so fast producers never block on slow consumers.`
			- `unsubscribe()` removes the queue when the SSE connection closes; `close_all_sse_queues()` sends a `None` sentinel to all subscribers during router shutdown.

feat: added buffer_lock to prevent race condition in high concurrency scenarios added documentation 2026-01-05 17:16:31 +01:00			`## Performance Considerations`

			`### Concurrency Model`

			`- Max concurrent connections: Configurable per endpoint-model pair`
			`- Connection pooling: Reuses aiohttp connections`
			`- Async I/O: All operations are non-blocking`
			`- Backpressure handling: Queues requests when endpoints are saturated`

			`### Caching Strategy`

			`- Short TTL for loaded models (30s): Ensures quick detection of model loading/unloading`
			`- Longer TTL for available models (300s): Reduces unnecessary API calls`
			`- Error caching (10s): Prevents thundering herd during outages`

			`### Memory Management`

			`- Write-behind pattern: Token counts buffered in memory, flushed periodically`
doc: feature updates 2026-04-14 09:17:33 +02:00			`- Queue-based SSE: Bounded per-subscriber queues (capacity 10) with oldest-eviction — see [Reactive SSE Push](#reactive-sse-push)`
feat: added buffer_lock to prevent race condition in high concurrency scenarios added documentation 2026-01-05 17:16:31 +01:00			`- Automatic cleanup: Zero connection counts are removed from tracking`

			`## Error Handling`

			`### Transient Errors`

			`- Temporary connection failures are cached for 10 seconds`
			`- During cache period, endpoint is treated as unavailable`
			`- After cache expires, endpoint is re-tested`

			`### Permanent Errors`

			`- Invalid model names result in clear error messages`
			`- Missing required fields return 400 Bad Request`
			`- Unreachable endpoints are reported with detailed connection issues`

			`### Health Monitoring`

			The `/health` endpoint provides comprehensive health status:

			```json
			`{`
			`"status": "ok" \| "error",`
			`"endpoints": {`
			`"http://endpoint1:11434": {`
			`"status": "ok" \| "error",`
			`"version": "string" \| "detail": "error message"`
			`}`
			`}`
			`}`
			```

			`## Database Schema`

			`The router uses SQLite for persistent storage:`

			```sql
			`CREATE TABLE token_counts (`
			`endpoint TEXT NOT NULL,`
			`model TEXT NOT NULL,`
			`input_tokens INTEGER NOT NULL,`
			`output_tokens INTEGER NOT NULL,`
			`total_tokens INTEGER NOT NULL,`
			`PRIMARY KEY (endpoint, model)`
			`);`

			`CREATE TABLE time_series (`
			`endpoint TEXT NOT NULL,`
			`model TEXT NOT NULL,`
			`input_tokens INTEGER NOT NULL,`
			`output_tokens INTEGER NOT NULL,`
			`total_tokens INTEGER NOT NULL,`
			`timestamp INTEGER NOT NULL,`
			`PRIMARY KEY (endpoint, model, timestamp)`
			`);`
			```

			`## Scaling Considerations`

			`### Horizontal Scaling`

			`- Multiple router instances can run behind a load balancer`
			`- Each instance maintains its own connection tracking`
			`- Stateless design allows for easy scaling`

			`### Vertical Scaling`

			`- Connection limits can be increased via aiohttp connector settings`
			`- Memory usage grows with number of tracked connections`
			`- Token buffer flushing interval can be adjusted`

			`## Security`

			`### Authentication`

			`- API keys are stored in config.yaml (can use environment variables)`
			`- Keys are passed to endpoints via Authorization headers`
			`- No authentication required for router itself (can be added via middleware)`

			`### Data Protection`

			`- All communication uses TLS when configured`
			`- No sensitive data logged (except in error messages)`
			`- Database contains only token counts and timestamps`

			`## Monitoring and Observability`

			`### Metrics Endpoints`

			- `/api/usage`: Current connection counts
			- `/api/token_counts`: Aggregated token usage
			- `/api/stats`: Detailed statistics per model
			- `/api/config`: Endpoint configuration and status
			- `/api/usage-stream`: Real-time usage updates via SSE

			`### Logging`

			`- Connection errors are logged with detailed context`
			`- Endpoint selection decisions are logged`
			`- Token counting operations are logged at debug level`

			`## Future Enhancements`

			`Potential areas for improvement:`

			`- Kubernetes operator for automatic deployment`
			`- Prometheus metrics endpoint`
			`- Distributed connection tracking (Redis)`
			`- Request retry logic with exponential backoff`
			`- Circuit breaker pattern for failing endpoints`
			`- Rate limiting per client`