Merge pull request #19 from nomyo-ai/dev-v0.5.x

feat: buffer_locks preventing race conditions in high concurrency scenarios documentation folder
2026-01-06 10:51:29 +01:00 · 2026-01-06 10:51:29 +01:00 · 6828411f95
commit 6828411f95
parent ac2a4fe8e0 20a016269d
9 changed files with 2167 additions and 42 deletions
--- a/doc/README.md
+++ b/doc/README.md
@ -0,0 +1,137 @@
+# NOMYO Router Documentation
+
+Welcome to the NOMYO Router documentation! This folder contains comprehensive guides for using, configuring, and deploying the NOMYO Router.
+
+## Documentation Structure
+
+```
+doc/
+├── architecture.md          # Technical architecture overview
+├── configuration.md         # Detailed configuration guide
+├── usage.md                 # API usage examples
+├── deployment.md            # Deployment scenarios
+├── monitoring.md            # Monitoring and troubleshooting
+└── examples/                # Example configurations and scripts
+    ├── docker-compose.yml
+    ├── sample-config.yaml
+    └── k8s-deployment.yaml
+```
+
+## Getting Started
+
+### Quick Start Guide
+
+1. **Install the router**:
+
+   ```bash
+   git clone https://github.com/nomyo-ai/nomyo-router.git
+   cd nomyo-router
+   python3 -m venv .venv/router
+   source .venv/router/bin/activate
+   pip3 install -r requirements.txt
+   ```
+2. **Configure endpoints** in `config.yaml`:
+
+   ```yaml
+   endpoints:
+     - http://localhost:11434
+   max_concurrent_connections: 2
+   ```
+3. **Run the router**:
+
+   ```bash
+   uvicorn router:app --host 0.0.0.0 --port 12434
+   ```
+4. **Use the router**: Point your frontend to `http://localhost:12434` instead of your Ollama instance.
+
+### Key Features
+
+- **Intelligent Routing**: Model deployment-aware routing with load balancing
+- **Multi-Endpoint Support**: Combine Ollama and OpenAI-compatible endpoints
+- **Token Tracking**: Comprehensive token usage monitoring
+- **Real-time Monitoring**: Server-Sent Events for live usage updates
+- **OpenAI Compatibility**: Full OpenAI API compatibility layer
+- **MOE System**: Multiple Opinions Ensemble for improved responses with smaller models
+
+## Documentation Guides
+
+### [Architecture](architecture.md)
+
+Learn about the router's internal architecture, routing algorithm, caching mechanisms, and advanced features like the MOE system.
+
+### [Configuration](configuration.md)
+
+Detailed guide on configuring the router with multiple endpoints, API keys, and environment variables.
+
+### [Usage](usage.md)
+
+Comprehensive API reference with examples for making requests, streaming responses, and using advanced features.
+
+### [Deployment](deployment.md)
+
+Step-by-step deployment guides for bare metal, Docker, Kubernetes, and production environments.
+
+### [Monitoring](monitoring.md)
+
+Monitoring endpoints, troubleshooting guides, performance tuning, and best practices for maintaining your router.
+
+## Examples
+
+The [examples](examples/) directory contains ready-to-use configuration files:
+
+- **docker-compose.yml**: Complete Docker Compose setup with multiple Ollama instances
+- **sample-config.yaml**: Example configuration with comments
+- **k8s-deployment.yaml**: Kubernetes deployment manifests
+
+## Need Help?
+
+### Common Issues
+
+Check the [Monitoring Guide](monitoring.md) for troubleshooting common problems:
+
+- Endpoint unavailable
+- Model not found
+- High latency
+- Connection limits reached
+- Token tracking issues
+
+### Support
+
+For additional help:
+
+1. Check the [GitHub Issues](https://github.com/nomyo-ai/nomyo-router/issues)
+2. Review the [Monitoring Guide](monitoring.md) for diagnostics
+3. Examine the router logs for detailed error messages
+
+## Best Practices
+
+### Configuration
+
+- Use environment variables for API keys
+- Set appropriate `max_concurrent_connections` based on your hardware
+- Monitor endpoint health regularly
+- Keep models loaded on multiple endpoints for redundancy
+
+### Deployment
+
+- Use Docker for containerized deployments
+- Consider Kubernetes for production at scale
+- Set up monitoring and alerting
+- Implement regular backups of token counts database
+
+### Performance
+
+- Balance load across multiple endpoints
+- Keep frequently used models loaded
+- Monitor connection counts and token usage
+- Scale horizontally when needed
+
+## Next Steps
+
+1. **Read the [Architecture Guide](architecture.md)** to understand how the router works
+2. **Configure your endpoints** in `config.yaml`
+3. **Deploy the router** using your preferred method
+4. **Monitor your setup** using the monitoring endpoints
+5. **Scale as needed** by adding more endpoints
+
+Happy routing! 🚀
--- a/doc/architecture.md
+++ b/doc/architecture.md
@ -0,0 +1,259 @@
+# NOMYO Router Architecture
+
+## Overview
+
+NOMYO Router is a transparent proxy for Ollama with model deployment-aware routing. It sits between your frontend application and Ollama backend(s), providing intelligent request routing based on model availability and load balancing.
+
+## Core Components
+
+### 1. Request Routing Engine
+
+The router's core intelligence is in the `choose_endpoint()` function, which implements a sophisticated routing algorithm:
+
+```python
+async def choose_endpoint(model: str) -> str:
+    """
+    Endpoint selection algorithm:
+    1. Query all endpoints for advertised models
+    2. Filter endpoints that advertise the requested model
+    3. Among candidates, find those with the model loaded AND free slots
+    4. If none loaded with free slots, pick any with free slots
+    5. If all saturated, pick endpoint with lowest current usage
+    6. If no endpoint advertises the model, raise error
+    """
+```
+
+### 2. Connection Tracking
+
+The router maintains real-time connection counts per endpoint-model pair:
+
+```python
+usage_counts: Dict[str, Dict[str, int]] = defaultdict(lambda: defaultdict(int))
+```
+
+This allows for:
+
+- **Load-aware routing**: Requests are routed to endpoints with available capacity
+- **Model-aware routing**: Requests are routed to endpoints where the model is already loaded
+- **Efficient resource utilization**: Minimizes model loading/unloading operations
+
+### 3. Caching Layer
+
+Three types of caches improve performance:
+
+- **Models cache** (`_models_cache`): Caches available models per endpoint (300s TTL)
+- **Loaded models cache** (`_loaded_models_cache`): Caches currently loaded models (30s TTL)
+- **Error cache** (`_error_cache`): Caches transient errors (10s TTL)
+
+### 4. Token Tracking System
+
+Comprehensive token usage tracking:
+
+```python
+token_buffer: dict[str, dict[str, tuple[int, int]]] = defaultdict(lambda: defaultdict(lambda: (0, 0)))
+time_series_buffer: list[dict[str, int | str]] = []
+```
+
+Features:
+
+- Real-time token counting for input/output tokens
+- Periodic flushing to SQLite database (every 10 seconds)
+- Time-series data for historical analysis
+- Per-endpoint, per-model breakdown
+
+### 5. API Compatibility Layer
+
+The router supports multiple API formats:
+
+- **Ollama API**: Native `/api/generate`, `/api/chat`, `/api/embed` endpoints
+- **OpenAI API**: Compatible `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings` endpoints
+- **Transparent conversion**: Responses are converted between formats as needed
+
+## Data Flow
+
+### Request Processing
+
+1. **Ingress**: Frontend sends request to router
+2. **Endpoint Selection**: Router determines optimal endpoint
+3. **Request Forwarding**: Request sent to selected Ollama endpoint
+4. **Response Streaming**: Response streamed back to frontend
+5. **Usage Tracking**: Connection and token counts updated
+6. **Egress**: Complete response returned to frontend
+
+### Connection Management
+
+```mermaid
+sequenceDiagram
+    participant Frontend
+    participant Router
+    participant Endpoint1
+    participant Endpoint2
+
+    Frontend->>Router: Request for model X
+    Router->>Endpoint1: Check if model X is loaded
+    Router->>Endpoint2: Check if model X is loaded
+    alt Endpoint1 has model X loaded
+        Router->>Endpoint1: Forward request
+        Endpoint1->>Router: Stream response
+        Router->>Frontend: Stream response
+    else Endpoint2 has model X loaded
+        Router->>Endpoint2: Forward request
+        Endpoint2->>Router: Stream response
+        Router->>Frontend: Stream response
+    else No endpoint has model X loaded
+        Router->>Endpoint1: Forward request (will trigger load)
+        Endpoint1->>Router: Stream response
+        Router->>Frontend: Stream response
+    end
+```
+
+## Advanced Features
+
+### Multiple Opinions Ensemble (MOE)
+
+When the user prefixes a model name with `moe-`, the router activates the MOE system:
+
+1. Generates 3 responses from different endpoints
+2. Generates 3 critiques of those responses
+3. Selects the best response based on critiques
+4. Generates final refined response
+
+### OpenAI Endpoint Support
+
+The router can proxy requests to OpenAI-compatible endpoints alongside Ollama endpoints. It automatically:
+
+- Detects OpenAI endpoints (those containing `/v1`)
+- Converts between Ollama and OpenAI response formats
+- Handles authentication with API keys
+- Maintains consistent behavior across endpoint types
+
+## Performance Considerations
+
+### Concurrency Model
+
+- **Max concurrent connections**: Configurable per endpoint-model pair
+- **Connection pooling**: Reuses aiohttp connections
+- **Async I/O**: All operations are non-blocking
+- **Backpressure handling**: Queues requests when endpoints are saturated
+
+### Caching Strategy
+
+- **Short TTL for loaded models** (30s): Ensures quick detection of model loading/unloading
+- **Longer TTL for available models** (300s): Reduces unnecessary API calls
+- **Error caching** (10s): Prevents thundering herd during outages
+
+### Memory Management
+
+- **Write-behind pattern**: Token counts buffered in memory, flushed periodically
+- **Queue-based SSE**: Server-Sent Events use bounded queues to prevent memory bloat
+- **Automatic cleanup**: Zero connection counts are removed from tracking
+
+## Error Handling
+
+### Transient Errors
+
+- Temporary connection failures are cached for 10 seconds
+- During cache period, endpoint is treated as unavailable
+- After cache expires, endpoint is re-tested
+
+### Permanent Errors
+
+- Invalid model names result in clear error messages
+- Missing required fields return 400 Bad Request
+- Unreachable endpoints are reported with detailed connection issues
+
+### Health Monitoring
+
+The `/health` endpoint provides comprehensive health status:
+
+```json
+{
+  "status": "ok" | "error",
+  "endpoints": {
+    "http://endpoint1:11434": {
+      "status": "ok" | "error",
+      "version": "string" | "detail": "error message"
+    }
+  }
+}
+```
+
+## Database Schema
+
+The router uses SQLite for persistent storage:
+
+```sql
+CREATE TABLE token_counts (
+    endpoint TEXT NOT NULL,
+    model TEXT NOT NULL,
+    input_tokens INTEGER NOT NULL,
+    output_tokens INTEGER NOT NULL,
+    total_tokens INTEGER NOT NULL,
+    PRIMARY KEY (endpoint, model)
+);
+
+CREATE TABLE time_series (
+    endpoint TEXT NOT NULL,
+    model TEXT NOT NULL,
+    input_tokens INTEGER NOT NULL,
+    output_tokens INTEGER NOT NULL,
+    total_tokens INTEGER NOT NULL,
+    timestamp INTEGER NOT NULL,
+    PRIMARY KEY (endpoint, model, timestamp)
+);
+```
+
+## Scaling Considerations
+
+### Horizontal Scaling
+
+- Multiple router instances can run behind a load balancer
+- Each instance maintains its own connection tracking
+- Stateless design allows for easy scaling
+
+### Vertical Scaling
+
+- Connection limits can be increased via aiohttp connector settings
+- Memory usage grows with number of tracked connections
+- Token buffer flushing interval can be adjusted
+
+## Security
+
+### Authentication
+
+- API keys are stored in config.yaml (can use environment variables)
+- Keys are passed to endpoints via Authorization headers
+- No authentication required for router itself (can be added via middleware)
+
+### Data Protection
+
+- All communication uses TLS when configured
+- No sensitive data logged (except in error messages)
+- Database contains only token counts and timestamps
+
+## Monitoring and Observability
+
+### Metrics Endpoints
+
+- `/api/usage`: Current connection counts
+- `/api/token_counts`: Aggregated token usage
+- `/api/stats`: Detailed statistics per model
+- `/api/config`: Endpoint configuration and status
+- `/api/usage-stream`: Real-time usage updates via SSE
+
+### Logging
+
+- Connection errors are logged with detailed context
+- Endpoint selection decisions are logged
+- Token counting operations are logged at debug level
+
+## Future Enhancements
+
+Potential areas for improvement:
+
+- Kubernetes operator for automatic deployment
+- Prometheus metrics endpoint
+- Distributed connection tracking (Redis)
+- Request retry logic with exponential backoff
+- Circuit breaker pattern for failing endpoints
+- Rate limiting per client
--- a/doc/configuration.md
+++ b/doc/configuration.md
@ -0,0 +1,197 @@
+# Configuration Guide
+
+## Configuration File
+
+The NOMYO Router is configured via a YAML file (default: `config.yaml`). This file defines the Ollama endpoints, connection limits, and API keys.
+
+### Basic Configuration
+
+```yaml
+# config.yaml
+endpoints:
+  - http://localhost:11434
+  - http://ollama-server:11434
+
+# Maximum concurrent connections *per endpoint‑model pair*
+max_concurrent_connections: 2
+```
+
+### Complete Example
+
+```yaml
+# config.yaml
+endpoints:
+  - http://192.168.0.50:11434
+  - http://192.168.0.51:11434
+  - http://192.168.0.52:11434
+  - https://api.openai.com/v1
+
+# Maximum concurrent connections *per endpoint‑model pair* (equals to OLLAMA_NUM_PARALLEL)
+max_concurrent_connections: 2
+
+# API keys for remote endpoints
+# Set an environment variable like OPENAI_KEY
+# Confirm endpoints are exactly as in endpoints block
+api_keys:
+  "http://192.168.0.50:11434": "ollama"
+  "http://192.168.0.51:11434": "ollama"
+  "http://192.168.0.52:11434": "ollama"
+  "https://api.openai.com/v1": "${OPENAI_KEY}"
+```
+
+## Configuration Options
+
+### `endpoints`
+
+**Type**: `list[str]`
+
+**Description**: List of Ollama endpoint URLs. Can include both Ollama endpoints (`http://host:11434`) and OpenAI-compatible endpoints (`https://api.openai.com/v1`).
+
+**Examples**:
+```yaml
+endpoints:
+  - http://localhost:11434
+  - http://ollama1:11434
+  - http://ollama2:11434
+  - https://api.openai.com/v1
+  - https://api.anthropic.com/v1
+```
+
+**Notes**:
+- Ollama endpoints use the standard `/api/` prefix
+- OpenAI-compatible endpoints use `/v1` prefix
+- The router automatically detects endpoint type based on URL pattern
+
+### `max_concurrent_connections`
+
+**Type**: `int`
+
+**Default**: `1`
+
+**Description**: Maximum number of concurrent connections allowed per endpoint-model pair. This corresponds to Ollama's `OLLAMA_NUM_PARALLEL` setting.
+
+**Example**:
+```yaml
+max_concurrent_connections: 4
+```
+
+**Notes**:
+- This setting controls how many requests can be processed simultaneously for a specific model on a specific endpoint
+- When this limit is reached, the router will route requests to other endpoints with available capacity
+- Higher values allow more parallel requests but may increase memory usage
+
+### `api_keys`
+
+**Type**: `dict[str, str]`
+
+**Description**: Mapping of endpoint URLs to API keys. Used for authenticating with remote endpoints.
+
+**Example**:
+```yaml
+api_keys:
+  "http://192.168.0.50:11434": "ollama"
+  "https://api.openai.com/v1": "${OPENAI_KEY}"
+```
+
+**Environment Variables**:
+- API keys can reference environment variables using `${VAR_NAME}` syntax
+- The router will expand these references at startup
+- Example: `${OPENAI_KEY}` will be replaced with the value of the `OPENAI_KEY` environment variable
+
+## Environment Variables
+
+### `NOMYO_ROUTER_CONFIG_PATH`
+
+**Description**: Path to the configuration file. If not set, defaults to `config.yaml` in the current working directory.
+
+**Example**:
+```bash
+export NOMYO_ROUTER_CONFIG_PATH=/etc/nomyo-router/config.yaml
+```
+
+### `NOMYO_ROUTER_DB_PATH`
+
+**Description**: Path to the SQLite database file for storing token counts. If not set, defaults to `token_counts.db` in the current working directory.
+
+**Example**:
+```bash
+export NOMYO_ROUTER_DB_PATH=/var/lib/nomyo-router/token_counts.db
+```
+
+### API-Specific Keys
+
+You can set API keys directly as environment variables:
+
+```bash
+export OPENAI_KEY=your_openai_api_key
+export ANTHROPIC_KEY=your_anthropic_api_key
+```
+
+## Configuration Best Practices
+
+### Multiple Ollama Instances
+
+For a cluster of Ollama instances:
+
+```yaml
+endpoints:
+  - http://ollama-worker1:11434
+  - http://ollama-worker2:11434
+  - http://ollama-worker3:11434
+
+max_concurrent_connections: 2
+```
+
+**Recommendation**: Set `max_concurrent_connections` to match your Ollama instances' `OLLAMA_NUM_PARALLEL` setting.
+
+### Mixed Endpoints
+
+Combining Ollama and OpenAI endpoints:
+
+```yaml
+endpoints:
+  - http://localhost:11434
+  - https://api.openai.com/v1
+
+api_keys:
+  "https://api.openai.com/v1": "${OPENAI_KEY}"
+```
+
+**Note**: The router will automatically route requests based on model availability across all endpoints.
+
+### High Availability
+
+For production deployments:
+
+```yaml
+endpoints:
+  - http://ollama-primary:11434
+  - http://ollama-secondary:11434
+  - http://ollama-tertiary:11434
+
+max_concurrent_connections: 3
+```
+
+**Recommendation**: Use multiple endpoints for redundancy and load distribution.
+
+## Configuration Validation
+
+The router validates the configuration at startup:
+
+1. **Endpoint URLs**: Must be valid URLs
+2. **API Keys**: Must be strings (can reference environment variables)
+3. **Connection Limits**: Must be positive integers
+
+If the configuration is invalid, the router will exit with an error message.
+
+## Dynamic Configuration
+
+The configuration is loaded at startup and cannot be changed without restarting the router. For production deployments, consider:
+
+1. Using a configuration management system
+2. Implementing a rolling restart strategy
+3. Using environment variables for sensitive data
+
+## Example Configurations
+
+See the [examples](examples/) directory for ready-to-use configuration examples.
--- a/doc/deployment.md
+++ b/doc/deployment.md
@ -0,0 +1,444 @@
+# Deployment Guide
+
+## Deployment Options
+
+NOMYO Router can be deployed in various environments depending on your requirements.
+
+## 1. Bare Metal / VM Deployment
+
+### Prerequisites
+
+- Python 3.10+
+- pip
+- Virtual environment (recommended)
+
+### Installation
+
+```bash
+# Clone the repository
+git clone https://github.com/nomyo-ai/nomyo-router.git
+cd nomyo-router
+
+# Create virtual environment
+python3 -m venv .venv/router
+source .venv/router/bin/activate
+
+# Install dependencies
+pip3 install -r requirements.txt
+
+# Configure endpoints
+nano config.yaml
+```
+
+### Running the Router
+
+```bash
+# Basic startup
+uvicorn router:app --host 0.0.0.0 --port 12434
+
+# With custom configuration path
+export NOMYO_ROUTER_CONFIG_PATH=/etc/nomyo-router/config.yaml
+uvicorn router:app --host 0.0.0.0 --port 12434
+
+# With custom database path
+export NOMYO_ROUTER_DB_PATH=/var/lib/nomyo-router/token_counts.db
+uvicorn router:app --host 0.0.0.0 --port 12434
+```
+
+### Systemd Service
+
+Create `/etc/systemd/system/nomyo-router.service`:
+
+```ini
+[Unit]
+Description=NOMYO Router - Ollama Proxy
+After=network.target
+
+[Service]
+User=nomyo
+Group=nomyo
+WorkingDirectory=/opt/nomyo-router
+Environment="NOMYO_ROUTER_CONFIG_PATH=/etc/nomyo-router/config.yaml"
+Environment="NOMYO_ROUTER_DB_PATH=/var/lib/nomyo-router/token_counts.db"
+ExecStart=/opt/nomyo-router/.venv/router/bin/uvicorn router:app --host 0.0.0.0 --port 12434
+Restart=always
+RestartSec=10
+StandardOutput=syslog
+StandardError=syslog
+SyslogIdentifier=nomyo-router
+
+[Install]
+WantedBy=multi-user.target
+```
+
+Enable and start the service:
+
+```bash
+sudo systemctl daemon-reload
+sudo systemctl enable nomyo-router
+sudo systemctl start nomyo-router
+sudo systemctl status nomyo-router
+```
+
+## 2. Docker Deployment
+
+### Build the Image
+
+```bash
+docker build -t nomyo-router .
+```
+
+### Run the Container
+
+```bash
+docker run -d \
+  --name nomyo-router \
+  -p 12434:12434 \
+  -v /absolute/path/to/config_folder:/app/config/ \
+  -e CONFIG_PATH=/app/config/config.yaml \
+  nomyo-router
+```
+
+### Advanced Docker Configuration
+
+#### Custom Port
+
+```bash
+docker run -d \
+  --name nomyo-router \
+  -p 9000:12434 \
+  -v /path/to/config:/app/config/ \
+  -e CONFIG_PATH=/app/config/config.yaml \
+  nomyo-router \
+  -- --port 9000
+```
+
+#### Custom Host
+
+```bash
+docker run -d \
+  --name nomyo-router \
+  -p 12434:12434 \
+  -v /path/to/config:/app/config/ \
+  -e CONFIG_PATH=/app/config/config.yaml \
+  -e UVICORN_HOST=0.0.0.0 \
+  nomyo-router
+```
+
+#### Persistent Database
+
+```bash
+docker run -d \
+  --name nomyo-router \
+  -p 12434:12434 \
+  -v /path/to/config:/app/config/ \
+  -v /path/to/db:/app/token_counts.db \
+  -e CONFIG_PATH=/app/config/config.yaml \
+  -e NOMYO_ROUTER_DB_PATH=/app/token_counts.db \
+  nomyo-router
+```
+
+### Docker Compose Example
+
+See [examples/docker-compose.yml](examples/docker-compose.yml) for a complete Docker Compose example.
+
+## 3. Kubernetes Deployment
+
+### Prerequisites
+
+- Kubernetes cluster
+- kubectl configured
+- Helm (optional)
+
+### Basic Deployment
+
+Create `nomyo-router-deployment.yaml`:
+
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: nomyo-router
+  labels:
+    app: nomyo-router
+spec:
+  replicas: 2
+  selector:
+    matchLabels:
+      app: nomyo-router
+  template:
+    metadata:
+      labels:
+        app: nomyo-router
+    spec:
+      containers:
+      - name: nomyo-router
+        image: nomyo-router:latest
+        ports:
+        - containerPort: 12434
+        env:
+        - name: CONFIG_PATH
+          value: "/app/config/config.yaml"
+        - name: NOMYO_ROUTER_DB_PATH
+          value: "/app/token_counts.db"
+        volumeMounts:
+        - name: config-volume
+          mountPath: /app/config
+        - name: db-volume
+          mountPath: /app/token_counts.db
+          subPath: token_counts.db
+      volumes:
+      - name: config-volume
+        configMap:
+          name: nomyo-router-config
+      - name: db-volume
+        persistentVolumeClaim:
+          claimName: nomyo-router-db-pvc
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: nomyo-router
+spec:
+  selector:
+    app: nomyo-router
+  ports:
+    - protocol: TCP
+      port: 80
+      targetPort: 12434
+  type: LoadBalancer
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: nomyo-router-config
+data:
+  config.yaml: |
+    endpoints:
+      - http://ollama-service:11434
+    max_concurrent_connections: 2
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: nomyo-router-db-pvc
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 1Gi
+```
+
+Apply the deployment:
+
+```bash
+kubectl apply -f nomyo-router-deployment.yaml
+```
+
+### Horizontal Pod Autoscaler
+
+Create `nomyo-router-hpa.yaml`:
+
+```yaml
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: nomyo-router-hpa
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: nomyo-router
+  minReplicas: 2
+  maxReplicas: 10
+  metrics:
+  - type: Resource
+    resource:
+      name: cpu
+      target:
+        type: Utilization
+        averageUtilization: 70
+```
+
+Apply the HPA:
+
+```bash
+kubectl apply -f nomyo-router-hpa.yaml
+```
+
+## 4. Production Deployment
+
+### High Availability Setup
+
+For production environments with multiple Ollama instances:
+
+```yaml
+# config.yaml
+endpoints:
+  - http://ollama-worker1:11434
+  - http://ollama-worker2:11434
+  - http://ollama-worker3:11434
+  - https://api.openai.com/v1
+
+max_concurrent_connections: 4
+
+api_keys:
+  "https://api.openai.com/v1": "${OPENAI_KEY}"
+```
+
+### Load Balancing
+
+Deploy multiple router instances behind a load balancer:
+
+```
+┌───────────────────────────────────────────────────────────────┐
+│                     Load Balancer (NGINX, Traefik)             │
+└───────────────────────────────────────────────────────────────┘
+                        │
+                        ├─┬───────────────────────────────────────┐
+                        │   │                                   │
+                        ▼   ▼                                   ▼
+┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐
+│  Router Instance │ │  Router Instance │ │  Router Instance        │
+│  (Pod 1)        │ │  (Pod 2)        │ │  (Pod 3)               │
+└─────────────────┘ └─────────────────┘ └─────────────────────────┘
+                        │
+                        ▼
+┌───────────────────────────────────────────────────────────────┐
+│                     Ollama Cluster                              │
+│  ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────────┐  │
+│  │ Ollama      │ │ Ollama      │ │ OpenAI API                 │  │
+│  │ Worker 1    │ │ Worker 2    │ │ (Fallback)                 │  │
+│  └─────────────┘ └─────────────┘ └─────────────────────────────┘  │
+└───────────────────────────────────────────────────────────────┘
+```
+
+### Monitoring and Logging
+
+#### Prometheus Monitoring
+
+Create a Prometheus scrape configuration:
+
+```yaml
+scrape_configs:
+  - job_name: 'nomyo-router'
+    metrics_path: '/metrics'
+    static_configs:
+      - targets: ['nomyo-router:12434']
+```
+
+#### Logging
+
+Configure log aggregation:
+
+```bash
+# In Docker
+docker run -d \
+  --name nomyo-router \
+  -p 12434:12434 \
+  -v /path/to/config:/app/config/ \
+  -e CONFIG_PATH=/app/config/config.yaml \
+  --log-driver=fluentd \
+  --log-opt fluentd-address=fluentd:24224 \
+  nomyo-router
+```
+
+## Deployment Checklist
+
+### Pre-Deployment
+
+- [ ] Configure all Ollama endpoints
+- [ ] Set appropriate `max_concurrent_connections`
+- [ ] Configure API keys for remote endpoints
+- [ ] Test configuration locally
+- [ ] Set up monitoring and alerting
+- [ ] Configure logging
+- [ ] Set up backup for token counts database
+
+### Post-Deployment
+
+- [ ] Verify health endpoint: `curl http://<router>/health`
+- [ ] Check endpoint status: `curl http://<router>/api/config`
+- [ ] Monitor connection counts: `curl http://<router>/api/usage`
+- [ ] Set up regular backups
+- [ ] Configure auto-restart on failure
+- [ ] Monitor performance metrics
+
+## Scaling Guidelines
+
+### Vertical Scaling
+
+- Increase `max_concurrent_connections` for more parallel requests
+- Add more CPU/memory to the router instance
+- Monitor memory usage (token buffer grows with usage)
+
+### Horizontal Scaling
+
+- Deploy multiple router instances
+- Use a load balancer to distribute traffic
+- Each instance maintains its own connection tracking
+- Database can be shared or per-instance
+
+### Database Considerations
+
+- SQLite is sufficient for single-instance deployments
+- For multi-instance deployments, consider PostgreSQL
+- Regular backups are recommended
+- Database size grows with token usage history
+
+## Security Best Practices
+
+### Network Security
+
+- Use TLS for all external connections
+- Restrict access to router port (12434)
+- Use firewall rules to limit access
+- Consider using VPN for internal communications
+
+### Configuration Security
+
+- Store API keys in environment variables
+- Restrict access to config.yaml
+- Use secrets management for production deployments
+- Rotate API keys regularly
+
+### Runtime Security
+
+- Run as non-root user
+- Set appropriate file permissions
+- Monitor for suspicious activity
+- Keep dependencies updated
+
+## Troubleshooting Deployment Issues
+
+### Common Issues
+
+**Problem**: Router not starting
+
+- **Check**: Logs for configuration errors
+- **Solution**: Validate config.yaml syntax
+
+**Problem**: Endpoints showing as unavailable
+
+- **Check**: Network connectivity from router to endpoints
+- **Solution**: Verify firewall rules and DNS resolution
+
+**Problem**: High latency
+
+- **Check**: Endpoint health and connection counts
+- **Solution**: Add more endpoints or increase concurrency limits
+
+**Problem**: Database errors
+
+- **Check**: Database file permissions
+- **Solution**: Ensure write permissions for the database path
+
+**Problem**: Connection limits being hit
+
+- **Check**: `/api/usage` endpoint
+- **Solution**: Increase `max_concurrent_connections` or add endpoints
+
+## Examples
+
+See the [examples](examples/) directory for ready-to-use deployment examples.
--- a/doc/examples/docker-compose.yml
+++ b/doc/examples/docker-compose.yml
@ -0,0 +1,98 @@
+# Docker Compose example for NOMYO Router with multiple Ollama instances
+
+version: '3.8'
+
+services:
+  # NOMYO Router
+  nomyo-router:
+    image: nomyo-router:latest
+    build: .
+    ports:
+      - "12434:12434"
+    environment:
+      - CONFIG_PATH=/app/config/config.yaml
+      - NOMYO_ROUTER_DB_PATH=/app/token_counts.db
+    volumes:
+      - ./config:/app/config
+      - router-db:/app/token_counts.db
+    depends_on:
+      - ollama1
+      - ollama2
+      - ollama3
+    restart: unless-stopped
+    networks:
+      - nomyo-net
+
+  # Ollama Instance 1
+  ollama1:
+    image: ollama/ollama:latest
+    ports:
+      - "11434:11434"
+    volumes:
+      - ollama1-data:/root/.ollama
+    environment:
+      - OLLAMA_NUM_PARALLEL=4
+    restart: unless-stopped
+    networks:
+      - nomyo-net
+
+  # Ollama Instance 2
+  ollama2:
+    image: ollama/ollama:latest
+    ports:
+      - "11435:11434"
+    volumes:
+      - ollama2-data:/root/.ollama
+    environment:
+      - OLLAMA_NUM_PARALLEL=4
+    restart: unless-stopped
+    networks:
+      - nomyo-net
+
+  # Ollama Instance 3
+  ollama3:
+    image: ollama/ollama:latest
+    ports:
+      - "11436:11434"
+    volumes:
+      - ollama3-data:/root/.ollama
+    environment:
+      - OLLAMA_NUM_PARALLEL=4
+    restart: unless-stopped
+    networks:
+      - nomyo-net
+
+  # Optional: Prometheus for monitoring
+  prometheus:
+    image: prom/prometheus:latest
+    ports:
+      - "9090:9090"
+    volumes:
+      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
+    command:
+      - '--config.file=/etc/prometheus/prometheus.yml'
+    restart: unless-stopped
+    networks:
+      - nomyo-net
+
+  # Optional: Grafana for visualization
+  grafana:
+    image: grafana/grafana:latest
+    ports:
+      - "3000:3000"
+    volumes:
+      - grafana-storage:/var/lib/grafana
+    restart: unless-stopped
+    networks:
+      - nomyo-net
+
+volumes:
+  router-db:
+  ollama1-data:
+  ollama2-data:
+  ollama3-data:
+  grafana-storage:
+
+networks:
+  nomyo-net:
+    driver: bridge
--- a/doc/examples/sample-config.yaml
+++ b/doc/examples/sample-config.yaml
@ -0,0 +1,37 @@
+# Sample NOMYO Router Configuration
+
+# Basic single endpoint configuration
+endpoints:
+  - http://localhost:11434
+
+max_concurrent_connections: 2
+
+# Multi-endpoint configuration with local Ollama instances
+# endpoints:
+#   - http://ollama-worker1:11434
+#   - http://ollama-worker2:11434
+#   - http://ollama-worker3:11434
+
+# Mixed configuration with Ollama and OpenAI endpoints
+# endpoints:
+#   - http://localhost:11434
+#   - https://api.openai.com/v1
+
+
+# API keys for remote endpoints
+# Use ${VAR_NAME} syntax to reference environment variables
+api_keys:
+  # Local Ollama instances typically don't require authentication
+  "http://localhost:11434": "ollama"
+
+  # Remote Ollama instances
+  # "http://remote-ollama:11434": "ollama"
+
+  # OpenAI API
+  # "https://api.openai.com/v1": "${OPENAI_KEY}"
+
+  # Anthropic API
+  # "https://api.anthropic.com/v1": "${ANTHROPIC_KEY}"
+
+  # Other OpenAI-compatible endpoints
+  # "https://api.mistral.ai/v1": "${MISTRAL_KEY}"
--- a/doc/monitoring.md
+++ b/doc/monitoring.md
@ -0,0 +1,515 @@
+# Monitoring and Troubleshooting Guide
+
+## Monitoring Overview
+
+NOMYO Router provides comprehensive monitoring capabilities to track performance, health, and usage patterns.
+
+## Monitoring Endpoints
+
+### Health Check
+
+```bash
+curl http://localhost:12434/health
+```
+
+Response:
+```json
+{
+  "status": "ok" | "error",
+  "endpoints": {
+    "http://endpoint1:11434": {
+      "status": "ok" | "error",
+      "version": "string" | "detail": "error message"
+    }
+  }
+}
+```
+
+**HTTP Status Codes**:
+- `200`: All endpoints healthy
+- `503`: One or more endpoints unhealthy
+
+### Current Usage
+
+```bash
+curl http://localhost:12434/api/usage
+```
+
+Response:
+```json
+{
+  "usage_counts": {
+    "http://endpoint1:11434": {
+      "llama3": 2,
+      "mistral": 1
+    },
+    "http://endpoint2:11434": {
+      "llama3": 0,
+      "mistral": 3
+    }
+  },
+  "token_usage_counts": {
+    "http://endpoint1:11434": {
+      "llama3": 1542,
+      "mistral": 876
+    }
+  }
+}
+```
+
+### Token Statistics
+
+```bash
+curl http://localhost:12434/api/token_counts
+```
+
+Response:
+```json
+{
+  "total_tokens": 2418,
+  "breakdown": [
+    {
+      "endpoint": "http://endpoint1:11434",
+      "model": "llama3",
+      "input_tokens": 120,
+      "output_tokens": 1422,
+      "total_tokens": 1542
+    },
+    {
+      "endpoint": "http://endpoint1:11434",
+      "model": "mistral",
+      "input_tokens": 80,
+      "output_tokens": 796,
+      "total_tokens": 876
+    }
+  ]
+}
+```
+
+### Model Statistics
+
+```bash
+curl http://localhost:12434/api/stats -X POST -d '{"model": "llama3"}'
+```
+
+Response:
+```json
+{
+  "model": "llama3",
+  "input_tokens": 120,
+  "output_tokens": 1422,
+  "total_tokens": 1542,
+  "time_series": [
+    {
+      "endpoint": "http://endpoint1:11434",
+      "timestamp": 1712345600,
+      "input_tokens": 20,
+      "output_tokens": 150,
+      "total_tokens": 170
+    }
+  ],
+  "endpoint_distribution": {
+    "http://endpoint1:11434": 1542
+  }
+}
+```
+
+### Configuration Status
+
+```bash
+curl http://localhost:12434/api/config
+```
+
+Response:
+```json
+{
+  "endpoints": [
+    {
+      "url": "http://endpoint1:11434",
+      "status": "ok" | "error",
+      "version": "string" | "detail": "error message"
+    }
+  ]
+}
+```
+
+### Real-time Usage Stream
+
+```bash
+curl http://localhost:12434/api/usage-stream
+```
+
+This provides Server-Sent Events (SSE) with real-time updates:
+```
+data: {"usage_counts": {...}, "token_usage_counts": {...}}
+
+data: {"usage_counts": {...}, "token_usage_counts": {...}}
+```
+
+## Monitoring Tools
+
+### Prometheus Integration
+
+Create a Prometheus scrape configuration:
+
+```yaml
+scrape_configs:
+  - job_name: 'nomyo-router'
+    metrics_path: '/api/usage'
+    params:
+      format: ['prometheus']
+    static_configs:
+      - targets: ['nomyo-router:12434']
+```
+
+### Grafana Dashboard
+
+Create a dashboard with these panels:
+- Endpoint health status
+- Current connection counts
+- Token usage (input/output/total)
+- Request rates
+- Response times
+- Error rates
+
+### Logging
+
+The router logs important events to stdout:
+- Configuration loading
+- Endpoint connection issues
+- Token counting operations
+- Error conditions
+
+## Troubleshooting Guide
+
+### Common Issues and Solutions
+
+#### 1. Endpoint Unavailable
+
+**Symptoms**:
+- Health check shows endpoint as "error"
+- Requests fail with connection errors
+
+**Diagnosis**:
+```bash
+curl http://localhost:12434/health
+curl http://localhost:12434/api/config
+```
+
+**Solutions**:
+- Verify Ollama endpoint is running
+- Check network connectivity
+- Verify firewall rules
+- Check DNS resolution
+- Test direct connection: `curl http://endpoint:11434/api/version`
+
+#### 2. Model Not Found
+
+**Symptoms**:
+- Error: "None of the configured endpoints advertise the model"
+- Requests fail with model not found
+
+**Diagnosis**:
+```bash
+curl http://localhost:12434/api/tags
+curl http://endpoint:11434/api/tags
+```
+
+**Solutions**:
+- Pull the model on the endpoint: `curl http://endpoint:11434/api/pull -d '{"name": "llama3"}'`
+- Verify model name spelling
+- Check if model is available on any endpoint
+- For OpenAI endpoints, ensure model exists in their catalog
+
+#### 3. High Latency
+
+**Symptoms**:
+- Slow response times
+- Requests timing out
+
+**Diagnosis**:
+```bash
+curl http://localhost:12434/api/usage
+curl http://localhost:12434/api/config
+```
+
+**Solutions**:
+- Check if endpoints are overloaded (high connection counts)
+- Increase `max_concurrent_connections`
+- Add more endpoints to the cluster
+- Monitor Ollama endpoint performance
+- Check network latency between router and endpoints
+
+#### 4. Connection Limits Reached
+
+**Symptoms**:
+- Requests queuing
+- High connection counts
+- Slow response times
+
+**Diagnosis**:
+```bash
+curl http://localhost:12434/api/usage
+```
+
+**Solutions**:
+- Increase `max_concurrent_connections` in config.yaml
+- Add more Ollama endpoints
+- Scale your Ollama cluster
+- Use MOE system for critical queries
+
+#### 5. Token Tracking Not Working
+
+**Symptoms**:
+- Token counts not updating
+- Database errors
+
+**Diagnosis**:
+```bash
+ls -la token_counts.db
+curl http://localhost:12434/api/token_counts
+```
+
+**Solutions**:
+- Verify database file permissions
+- Check if database path is writable
+- Restart router to rebuild database
+- Check disk space
+- Verify environment variable `NOMYO_ROUTER_DB_PATH`
+
+#### 6. Streaming Issues
+
+**Symptoms**:
+- Incomplete responses
+- Connection resets during streaming
+- Timeout errors
+
+**Diagnosis**:
+- Check router logs for errors
+- Test with non-streaming requests
+- Monitor connection counts
+
+**Solutions**:
+- Increase timeout settings
+- Reduce `max_concurrent_connections`
+- Check network stability
+- Test with smaller payloads
+
+### Error Messages
+
+#### "Failed to connect to endpoint"
+
+**Cause**: Network connectivity issue
+**Action**: Verify endpoint is reachable, check firewall, test DNS
+
+#### "None of the configured endpoints advertise the model"
+
+**Cause**: Model not pulled on any endpoint
+**Action**: Pull the model, verify model name
+
+#### "Timed out waiting for endpoint"
+
+**Cause**: Endpoint slow to respond
+**Action**: Check endpoint health, increase timeouts
+
+#### "Invalid JSON format in request body"
+
+**Cause**: Malformed request
+**Action**: Validate request payload, check API documentation
+
+#### "Missing required field 'model'"
+
+**Cause**: Request missing model parameter
+**Action**: Add model parameter to request
+
+### Performance Tuning
+
+#### Optimizing Connection Handling
+
+1. **Adjust concurrency limits**:
+   ```yaml
+   max_concurrent_connections: 4
+   ```
+
+2. **Monitor connection usage**:
+   ```bash
+   curl http://localhost:12434/api/usage
+   ```
+
+3. **Scale horizontally**:
+   - Add more Ollama endpoints
+   - Deploy multiple router instances
+
+#### Reducing Latency
+
+1. **Keep models loaded**:
+   - Use frequently accessed models
+   - Monitor `/api/ps` for loaded models
+
+2. **Optimize endpoint selection**:
+   - Distribute models across endpoints
+   - Balance load evenly
+
+3. **Use caching**:
+   - Models cache (300s TTL)
+   - Loaded models cache (30s TTL)
+
+#### Memory Management
+
+1. **Monitor memory usage**:
+   - Token buffer grows with usage
+   - Time-series data accumulates
+
+2. **Adjust flush interval**:
+   - Default: 10 seconds
+   - Can be increased for less frequent I/O
+
+3. **Database maintenance**:
+   - Regular backups
+   - Archive old data periodically
+
+### Database Management
+
+#### Viewing Token Data
+
+```bash
+sqlite3 token_counts.db "SELECT * FROM token_counts;"
+sqlite3 token_counts.db "SELECT * FROM time_series LIMIT 100;"
+```
+
+#### Aggregating Old Data
+
+```bash
+curl http://localhost:12434/api/aggregate_time_series_days \
+  -X POST \
+  -d '{"days": 30, "trim_old": true}'
+```
+
+#### Backing Up Database
+
+```bash
+cp token_counts.db token_counts.db.backup
+```
+
+#### Restoring Database
+
+```bash
+cp token_counts.db.backup token_counts.db
+```
+
+### Advanced Troubleshooting
+
+#### Debugging Endpoint Selection
+
+Enable debug logging to see endpoint selection decisions:
+```bash
+uvicorn router:app --host 0.0.0.0 --port 12434 --log-level debug
+```
+
+#### Testing Individual Endpoints
+
+```bash
+# Test endpoint directly
+curl http://endpoint:11434/api/version
+
+# Test model availability
+curl http://endpoint:11434/api/tags
+
+# Test model loading
+curl http://endpoint:11434/api/ps
+```
+
+#### Network Diagnostics
+
+```bash
+# Test connectivity
+nc -zv endpoint 11434
+
+# Test DNS resolution
+dig endpoint
+
+# Test latency
+ping endpoint
+```
+
+### Common Pitfalls
+
+1. **Using localhost in Docker**:
+   - Inside Docker, `localhost` refers to the container itself
+   - Use `host.docker.internal` or Docker service names
+
+2. **Incorrect model names**:
+   - Ollama: `llama3:latest`
+   - OpenAI: `gpt-4` (no version suffix)
+
+3. **Missing API keys**:
+   - Remote endpoints require authentication
+   - Set keys in config.yaml or environment variables
+
+4. **Firewall blocking**:
+   - Ensure port 11434 is open for Ollama
+   - Ensure port 12434 is open for router
+
+5. **Insufficient resources**:
+   - Monitor CPU/memory on Ollama endpoints
+   - Adjust `max_concurrent_connections` accordingly
+
+## Best Practices
+
+### Monitoring Setup
+
+1. **Set up health checks**:
+   - Monitor `/health` endpoint
+   - Alert on status "error"
+
+2. **Track usage metrics**:
+   - Monitor connection counts
+   - Track token usage
+   - Watch for connection limits
+
+3. **Log important events**:
+   - Configuration changes
+   - Endpoint failures
+   - Recovery events
+
+4. **Regular backups**:
+   - Backup token_counts.db
+   - Schedule regular backups
+   - Test restore procedure
+
+### Performance Monitoring
+
+1. **Baseline metrics**:
+   - Establish normal usage patterns
+   - Track trends over time
+
+2. **Alert thresholds**:
+   - Set alerts for high connection counts
+   - Monitor error rates
+   - Watch for latency spikes
+
+3. **Capacity planning**:
+   - Track growth in token usage
+   - Plan for scaling needs
+   - Monitor resource utilization
+
+### Incident Response
+
+1. **Quick diagnosis**:
+   - Check health endpoint first
+   - Review recent logs
+   - Verify endpoint status
+
+2. **Isolation**:
+   - Identify affected endpoints
+   - Isolate problematic components
+   - Fallback to healthy endpoints
+
+3. **Recovery**:
+   - Restart router if needed
+   - Rebalance load
+   - Restore from backup if necessary
+
+## Examples
+
+See the [examples](examples/) directory for monitoring configuration examples.
--- a/doc/usage.md
+++ b/doc/usage.md
@ -0,0 +1,348 @@
+# Usage Guide
+
+## Quick Start
+
+### 1. Install the Router
+
+```bash
+git clone https://github.com/nomyo-ai/nomyo-router.git
+cd nomyo-router
+python3 -m venv .venv/router
+source .venv/router/bin/activate
+pip3 install -r requirements.txt
+```
+
+### 2. Configure Endpoints
+
+Edit `config.yaml`:
+
+```yaml
+endpoints:
+  - http://localhost:11434
+
+max_concurrent_connections: 2
+```
+
+### 3. Run the Router
+
+```bash
+uvicorn router:app --host 0.0.0.0 --port 12434
+```
+
+### 4. Use the Router
+
+Configure your frontend to point to `http://localhost:12434` instead of your Ollama instance.
+
+## API Endpoints
+
+### Ollama-Compatible Endpoints
+
+The router provides all standard Ollama API endpoints:
+
+| Endpoint        | Method | Description           |
+| --------------- | ------ | --------------------- |
+| `/api/generate` | POST   | Generate text         |
+| `/api/chat`     | POST   | Chat completions      |
+| `/api/embed`    | POST   | Embeddings            |
+| `/api/tags`     | GET    | List available models |
+| `/api/ps`       | GET    | List loaded models    |
+| `/api/show`     | POST   | Show model details    |
+| `/api/pull`     | POST   | Pull a model          |
+| `/api/push`     | POST   | Push a model          |
+| `/api/create`   | POST   | Create a model        |
+| `/api/copy`     | POST   | Copy a model          |
+| `/api/delete`   | DELETE | Delete a model        |
+
+### OpenAI-Compatible Endpoints
+
+For OpenAI API compatibility:
+
+| Endpoint               | Method | Description      |
+| ---------------------- | ------ | ---------------- |
+| `/v1/chat/completions` | POST   | Chat completions |
+| `/v1/completions`      | POST   | Text completions |
+| `/v1/embeddings`       | POST   | Embeddings       |
+| `/v1/models`           | GET    | List models      |
+
+### Monitoring Endpoints
+
+| Endpoint                           | Method | Description                              |
+| ---------------------------------- | ------ | ---------------------------------------- |
+| `/api/usage`                       | GET    | Current connection counts                |
+| `/api/token_counts`                | GET    | Token usage statistics                   |
+| `/api/stats`                       | POST   | Detailed model statistics                |
+| `/api/aggregate_time_series_days`  | POST   | Aggregate time series data into daily    |
+| `/api/version`                     | GET    | Ollama version info                      |
+| `/api/config`                      | GET    | Endpoint configuration                   |
+| `/api/usage-stream`                | GET    | Real-time usage updates (SSE)            |
+| `/health`                          | GET    | Health check                             |
+
+## Making Requests
+
+### Basic Chat Request
+
+```bash
+curl http://localhost:12434/api/chat \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama3:latest",
+    "messages": [
+      {"role": "user", "content": "Hello, how are you?"}
+    ],
+    "stream": false
+  }'
+```
+
+### Streaming Response
+
+```bash
+curl http://localhost:12434/api/chat \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama3:latest",
+    "messages": [
+      {"role": "user", "content": "Tell me a story"}
+    ],
+    "stream": true
+  }'
+```
+
+### OpenAI API Format
+
+```bash
+curl http://localhost:12434/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "gpt-4o-nano",
+    "messages": [
+      {"role": "user", "content": "Hello"}
+    ]
+  }'
+```
+
+## Advanced Features
+
+### Multiple Opinions Ensemble (MOE)
+
+Prefix your model name with `moe-` to enable the MOE system:
+
+```bash
+curl http://localhost:12434/api/chat \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "moe-llama3:latest",
+    "messages": [
+      {"role": "user", "content": "Explain quantum computing"}
+    ]
+  }'
+```
+
+The MOE system:
+
+1. Generates 3 responses from different endpoints
+2. Generates 3 critiques of those responses
+3. Selects the best response
+4. Generates a final refined response
+
+### Token Tracking
+
+The router automatically tracks token usage:
+
+```bash
+curl http://localhost:12434/api/token_counts
+```
+
+Response:
+
+```json
+{
+  "total_tokens": 1542,
+  "breakdown": [
+    {
+      "endpoint": "http://localhost:11434",
+      "model": "llama3",
+      "input_tokens": 120,
+      "output_tokens": 1422,
+      "total_tokens": 1542
+    }
+  ]
+}
+```
+
+### Real-time Monitoring
+
+Use Server-Sent Events to monitor usage in real-time:
+
+```bash
+curl http://localhost:12434/api/usage-stream
+```
+
+## Integration Examples
+
+### Python Client
+
+```python
+import requests
+
+url = "http://localhost:12434/api/chat"
+data = {
+    "model": "llama3",
+    "messages": [{"role": "user", "content": "What is AI?"}],
+    "stream": False
+}
+
+response = requests.post(url, json=data)
+print(response.json())
+```
+
+### JavaScript Client
+
+```javascript
+const response = await fetch('http://localhost:12434/api/chat', {
+  method: 'POST',
+  headers: {
+    'Content-Type': 'application/json',
+  },
+  body: JSON.stringify({
+    model: 'llama3',
+    messages: [{ role: 'user', content: 'Hello!' }],
+    stream: false
+  })
+});
+
+const data = await response.json();
+console.log(data);
+```
+
+### Streaming with JavaScript
+
+```javascript
+const eventSource = new EventSource('http://localhost:12434/api/usage-stream');
+
+eventSource.onmessage = (event) => {
+  const usage = JSON.parse(event.data);
+  console.log('Current usage:', usage);
+};
+```
+
+## Python Ollama Client
+
+```python
+from ollama import Client
+
+# Configure the client to use the router
+client = Client(host='http://localhost:12434')
+
+# Chat with a model
+response = client.chat(
+    model='llama3:latest',
+    messages=[
+        {'role': 'user', 'content': 'Explain quantum computing'}
+    ]
+)
+print(response['message']['content'])
+
+# Generate text
+response = client.generate(
+    model='llama3:latest',
+    prompt='Write a short poem about AI'
+)
+print(response['response'])
+
+# List available models
+models = client.list()['models']
+print(f"Available models: {[m['name'] for m in models]}")
+```
+
+### Python OpenAI Client
+
+```python
+from openai import OpenAI
+
+# Configure the client to use the router
+client = OpenAI(
+    base_url='http://localhost:12434/v1',
+    api_key='not-needed'  # API key is not required for local usage
+)
+
+# Chat completions
+response = client.chat.completions.create(
+    model='gpt-4o-nano',
+    messages=[
+        {'role': 'user', 'content': 'What is the meaning of life?'}
+    ]
+)
+print(response.choices[0].message.content)
+
+# Text completions
+response = client.completions.create(
+    model='llama3:latest',
+    prompt='Once upon a time'
+)
+print(response.choices[0].text)
+
+# Embeddings
+response = client.embeddings.create(
+    model='llama3:latest',
+    input='The quick brown fox jumps over the lazy dog'
+)
+print(f"Embedding length: {len(response.data[0].embedding)}")
+
+# List models
+response = client.models.list()
+print(f"Available models: {[m.id for m in response.data]}")
+```
+
+## Best Practices
+
+### 1. Model Selection
+
+- Use the same model name across all endpoints
+- For Ollama, append `:latest` or a specific version tag
+- For OpenAI endpoints, use the model name without version
+
+### 2. Connection Management
+
+- Set `max_concurrent_connections` appropriately for your hardware
+- Monitor `/api/usage` to ensure endpoints aren't overloaded
+- Consider using the MOE system for critical queries
+
+### 3. Error Handling
+
+- Check the `/health` endpoint regularly
+- Implement retry logic with exponential backoff
+- Monitor error rates and connection failures
+
+### 4. Performance
+
+- Keep frequently used models loaded on multiple endpoints
+- Use streaming for large responses
+- Monitor token usage to optimize costs
+
+## Troubleshooting
+
+### Common Issues
+
+**Problem**: Model not found
+
+- **Solution**: Ensure the model is pulled on at least one endpoint
+- **Check**: `curl http://localhost:12434/api/tags`
+
+**Problem**: Connection refused
+
+- **Solution**: Verify Ollama endpoints are running and accessible
+- **Check**: `curl http://localhost:12434/health`
+
+**Problem**: High latency
+
+- **Solution**: Check endpoint health and connection counts
+- **Check**: `curl http://localhost:12434/api/usage`
+
+**Problem**: Token tracking not working
+
+- **Solution**: Ensure database path is writable
+- **Check**: `ls -la token_counts.db`
+
+## Examples
+
+See the [examples](examples/) directory for complete integration examples.
--- a/router.py
+++ b/router.py
@ -9,6 +9,9 @@ license: AGPL
 import orjson, time, asyncio, yaml, ollama, openai, os, re, aiohttp, ssl, random, base64, io, enhance
 from datetime import datetime, timezone
 from pathlib import Path
+
+# Directory containing static files (relative to this script)
+STATIC_DIR = Path(__file__).parent / "static"
 from typing import Dict, Set, List, Optional
 from urllib.parse import urlparse
 from fastapi import FastAPI, Request, HTTPException
@ -55,6 +58,8 @@ flush_task: asyncio.Task | None = None
 token_buffer: dict[str, dict[str, tuple[int, int]]] = defaultdict(lambda: defaultdict(lambda: (0, 0)))
 # Time series buffer with timestamp
 time_series_buffer: list[dict[str, int | str]] = []
+# Lock to protect buffer access from race conditions
+buffer_lock = asyncio.Lock()

 # Configuration for periodic flushing
 FLUSH_INTERVAL = 10  # seconds
@ -218,45 +223,112 @@ def is_ext_openai_endpoint(endpoint: str) -> bool:
    return True  # It's an external OpenAI endpoint

 async def token_worker() -> None:
-    while True:
-        endpoint, model, prompt, comp = await token_queue.get()
-        # Accumulate counts in memory buffer
-        token_buffer[endpoint][model] = (
-            token_buffer[endpoint].get(model, (0, 0))[0] + prompt,
-            token_buffer[endpoint].get(model, (0, 0))[1] + comp
-        )
+    try:
+        while True:
+            endpoint, model, prompt, comp = await token_queue.get()
+            # Accumulate counts in memory buffer (protected by lock)
+            async with buffer_lock:
+                token_buffer[endpoint][model] = (
+                    token_buffer[endpoint].get(model, (0, 0))[0] + prompt,
+                    token_buffer[endpoint].get(model, (0, 0))[1] + comp
+                )

-        # Add to time series buffer with timestamp (UTC)
-        now = datetime.now(tz=timezone.utc)
-        timestamp = int(datetime(now.year, now.month, now.day, now.hour, now.minute, tzinfo=timezone.utc).timestamp())
-        time_series_buffer.append({
-            'endpoint': endpoint,
-            'model': model,
-            'input_tokens': prompt,
-            'output_tokens': comp,
-            'total_tokens': prompt + comp,
-            'timestamp': timestamp
-        })
+                # Add to time series buffer with timestamp (UTC)
+                now = datetime.now(tz=timezone.utc)
+                timestamp = int(datetime(now.year, now.month, now.day, now.hour, now.minute, tzinfo=timezone.utc).timestamp())
+                time_series_buffer.append({
+                    'endpoint': endpoint,
+                    'model': model,
+                    'input_tokens': prompt,
+                    'output_tokens': comp,
+                    'total_tokens': prompt + comp,
+                    'timestamp': timestamp
+                })

-        # Update in-memory counts for immediate reporting
-        async with token_usage_lock:
-            token_usage_counts[endpoint][model] += (prompt + comp)
-            await publish_snapshot()
+            # Update in-memory counts for immediate reporting
+            async with token_usage_lock:
+                token_usage_counts[endpoint][model] += (prompt + comp)
+                await publish_snapshot()
+    except asyncio.CancelledError:
+        # Gracefully handle task cancellation during shutdown
+        print("[token_worker] Task cancelled, processing remaining queue items...")
+        # Process any remaining items in the queue before exiting
+        while not token_queue.empty():
+            try:
+                endpoint, model, prompt, comp = token_queue.get_nowait()
+                async with buffer_lock:
+                    token_buffer[endpoint][model] = (
+                        token_buffer[endpoint].get(model, (0, 0))[0] + prompt,
+                        token_buffer[endpoint].get(model, (0, 0))[1] + comp
+                    )
+                    now = datetime.now(tz=timezone.utc)
+                    timestamp = int(datetime(now.year, now.month, now.day, now.hour, now.minute, tzinfo=timezone.utc).timestamp())
+                    time_series_buffer.append({
+                        'endpoint': endpoint,
+                        'model': model,
+                        'input_tokens': prompt,
+                        'output_tokens': comp,
+                        'total_tokens': prompt + comp,
+                        'timestamp': timestamp
+                    })
+                async with token_usage_lock:
+                    token_usage_counts[endpoint][model] += (prompt + comp)
+                    await publish_snapshot()
+            except asyncio.QueueEmpty:
+                break
+        print("[token_worker] Task cancelled, remaining items processed.")
+        raise

 async def flush_buffer() -> None:
    """Periodically flush accumulated token counts to the database."""
-    while True:
-        await asyncio.sleep(FLUSH_INTERVAL)
+    try:
+        while True:
+            await asyncio.sleep(FLUSH_INTERVAL)

-        # Flush token counts
-        if token_buffer:
-            await db.update_batched_counts(token_buffer)
-            token_buffer.clear()
+            # Flush token counts and time series (protected by lock)
+            async with buffer_lock:
+                if token_buffer:
+                    # Copy buffer before releasing lock for DB operation
+                    buffer_copy = {ep: dict(models) for ep, models in token_buffer.items()}
+                    token_buffer.clear()
+                else:
+                    buffer_copy = None

-        # Flush time series entries
-        if time_series_buffer:
-            await db.add_batched_time_series(time_series_buffer)
-            time_series_buffer.clear()
+                if time_series_buffer:
+                    ts_copy = list(time_series_buffer)
+                    time_series_buffer.clear()
+                else:
+                    ts_copy = None
+
+            # Perform DB operations outside the lock to avoid blocking
+            if buffer_copy:
+                await db.update_batched_counts(buffer_copy)
+            if ts_copy:
+                await db.add_batched_time_series(ts_copy)
+    except asyncio.CancelledError:
+        # Gracefully handle task cancellation during shutdown
+        print("[flush_buffer] Task cancelled, flushing remaining buffers...")
+        # Flush any remaining data before exiting
+        try:
+            async with buffer_lock:
+                if token_buffer:
+                    buffer_copy = {ep: dict(models) for ep, models in token_buffer.items()}
+                    token_buffer.clear()
+                else:
+                    buffer_copy = None
+                if time_series_buffer:
+                    ts_copy = list(time_series_buffer)
+                    time_series_buffer.clear()
+                else:
+                    ts_copy = None
+            if buffer_copy:
+                await db.update_batched_counts(buffer_copy)
+            if ts_copy:
+                await db.add_batched_time_series(ts_copy)
+            print("[flush_buffer] Task cancelled, remaining buffers flushed.")
+        except Exception as e:
+            print(f"[flush_buffer] Error during shutdown flush: {e}")
+        raise

 async def flush_remaining_buffers() -> None:
    """
@ -265,14 +337,24 @@ async def flush_remaining_buffers() -> None:
    """
    try:
        flushed_entries = 0
-        if token_buffer:
-            await db.update_batched_counts(token_buffer)
-            flushed_entries += sum(len(v) for v in token_buffer.values())
-            token_buffer.clear()
-        if time_series_buffer:
-            await db.add_batched_time_series(time_series_buffer)
-            flushed_entries += len(time_series_buffer)
-            time_series_buffer.clear()
+        async with buffer_lock:
+            if token_buffer:
+                buffer_copy = {ep: dict(models) for ep, models in token_buffer.items()}
+                flushed_entries += sum(len(v) for v in token_buffer.values())
+                token_buffer.clear()
+            else:
+                buffer_copy = None
+            if time_series_buffer:
+                ts_copy = list(time_series_buffer)
+                flushed_entries += len(time_series_buffer)
+                time_series_buffer.clear()
+            else:
+                ts_copy = None
+        # Perform DB operations outside the lock
+        if buffer_copy:
+            await db.update_batched_counts(buffer_copy)
+        if ts_copy:
+            await db.add_batched_time_series(ts_copy)
        if flushed_entries:
            print(f"[shutdown] Flushed {flushed_entries} in-memory entries to DB on shutdown.")
        else:
@ -884,7 +966,9 @@ async def choose_endpoint(model: str) -> str:
        ]

        if endpoints_with_free_slot:
-            return random.choice(endpoints_with_free_slot)
+            #return random.choice(endpoints_with_free_slot)
+            endpoints_with_free_slot.sort(key=lambda ep: sum(usage_counts.get(ep, {}).values()))
+            return endpoints_with_free_slot[0]

        # 5️⃣ All candidate endpoints are saturated – pick one with lowest usages count (will queue)
        ep = min(candidate_endpoints, key=current_usage)
@ -2053,7 +2137,13 @@ async def index(request: Request):
    Render the dynamic NOMYO Router dashboard listing the configured endpoints
    and the models details, availability & task status.
    """
-    return HTMLResponse(content=open("static/index.html", "r").read(), status_code=200)
+    index_path = STATIC_DIR / "index.html"
+    try:
+        return HTMLResponse(content=index_path.read_text(encoding="utf-8"), status_code=200)
+    except FileNotFoundError:
+        raise HTTPException(status_code=404, detail="Page not found")
+    except Exception:
+        raise HTTPException(status_code=500, detail="Internal server error")

 # -------------------------------------------------------------
 # 26. Healthendpoint