Merge pull request #19 from nomyo-ai/dev-v0.5.x

feat:
buffer_locks preventing race conditions in high concurrency scenarios
documentation folder
This commit is contained in:
Alpha Nerd 2026-01-06 10:51:29 +01:00 committed by GitHub
commit 6828411f95
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
9 changed files with 2167 additions and 42 deletions

137
doc/README.md Normal file
View file

@ -0,0 +1,137 @@
# NOMYO Router Documentation
Welcome to the NOMYO Router documentation! This folder contains comprehensive guides for using, configuring, and deploying the NOMYO Router.
## Documentation Structure
```
doc/
├── architecture.md # Technical architecture overview
├── configuration.md # Detailed configuration guide
├── usage.md # API usage examples
├── deployment.md # Deployment scenarios
├── monitoring.md # Monitoring and troubleshooting
└── examples/ # Example configurations and scripts
├── docker-compose.yml
├── sample-config.yaml
└── k8s-deployment.yaml
```
## Getting Started
### Quick Start Guide
1. **Install the router**:
```bash
git clone https://github.com/nomyo-ai/nomyo-router.git
cd nomyo-router
python3 -m venv .venv/router
source .venv/router/bin/activate
pip3 install -r requirements.txt
```
2. **Configure endpoints** in `config.yaml`:
```yaml
endpoints:
- http://localhost:11434
max_concurrent_connections: 2
```
3. **Run the router**:
```bash
uvicorn router:app --host 0.0.0.0 --port 12434
```
4. **Use the router**: Point your frontend to `http://localhost:12434` instead of your Ollama instance.
### Key Features
- **Intelligent Routing**: Model deployment-aware routing with load balancing
- **Multi-Endpoint Support**: Combine Ollama and OpenAI-compatible endpoints
- **Token Tracking**: Comprehensive token usage monitoring
- **Real-time Monitoring**: Server-Sent Events for live usage updates
- **OpenAI Compatibility**: Full OpenAI API compatibility layer
- **MOE System**: Multiple Opinions Ensemble for improved responses with smaller models
## Documentation Guides
### [Architecture](architecture.md)
Learn about the router's internal architecture, routing algorithm, caching mechanisms, and advanced features like the MOE system.
### [Configuration](configuration.md)
Detailed guide on configuring the router with multiple endpoints, API keys, and environment variables.
### [Usage](usage.md)
Comprehensive API reference with examples for making requests, streaming responses, and using advanced features.
### [Deployment](deployment.md)
Step-by-step deployment guides for bare metal, Docker, Kubernetes, and production environments.
### [Monitoring](monitoring.md)
Monitoring endpoints, troubleshooting guides, performance tuning, and best practices for maintaining your router.
## Examples
The [examples](examples/) directory contains ready-to-use configuration files:
- **docker-compose.yml**: Complete Docker Compose setup with multiple Ollama instances
- **sample-config.yaml**: Example configuration with comments
- **k8s-deployment.yaml**: Kubernetes deployment manifests
## Need Help?
### Common Issues
Check the [Monitoring Guide](monitoring.md) for troubleshooting common problems:
- Endpoint unavailable
- Model not found
- High latency
- Connection limits reached
- Token tracking issues
### Support
For additional help:
1. Check the [GitHub Issues](https://github.com/nomyo-ai/nomyo-router/issues)
2. Review the [Monitoring Guide](monitoring.md) for diagnostics
3. Examine the router logs for detailed error messages
## Best Practices
### Configuration
- Use environment variables for API keys
- Set appropriate `max_concurrent_connections` based on your hardware
- Monitor endpoint health regularly
- Keep models loaded on multiple endpoints for redundancy
### Deployment
- Use Docker for containerized deployments
- Consider Kubernetes for production at scale
- Set up monitoring and alerting
- Implement regular backups of token counts database
### Performance
- Balance load across multiple endpoints
- Keep frequently used models loaded
- Monitor connection counts and token usage
- Scale horizontally when needed
## Next Steps
1. **Read the [Architecture Guide](architecture.md)** to understand how the router works
2. **Configure your endpoints** in `config.yaml`
3. **Deploy the router** using your preferred method
4. **Monitor your setup** using the monitoring endpoints
5. **Scale as needed** by adding more endpoints
Happy routing! 🚀

259
doc/architecture.md Normal file
View file

@ -0,0 +1,259 @@
# NOMYO Router Architecture
## Overview
NOMYO Router is a transparent proxy for Ollama with model deployment-aware routing. It sits between your frontend application and Ollama backend(s), providing intelligent request routing based on model availability and load balancing.
## Core Components
### 1. Request Routing Engine
The router's core intelligence is in the `choose_endpoint()` function, which implements a sophisticated routing algorithm:
```python
async def choose_endpoint(model: str) -> str:
"""
Endpoint selection algorithm:
1. Query all endpoints for advertised models
2. Filter endpoints that advertise the requested model
3. Among candidates, find those with the model loaded AND free slots
4. If none loaded with free slots, pick any with free slots
5. If all saturated, pick endpoint with lowest current usage
6. If no endpoint advertises the model, raise error
"""
```
### 2. Connection Tracking
The router maintains real-time connection counts per endpoint-model pair:
```python
usage_counts: Dict[str, Dict[str, int]] = defaultdict(lambda: defaultdict(int))
```
This allows for:
- **Load-aware routing**: Requests are routed to endpoints with available capacity
- **Model-aware routing**: Requests are routed to endpoints where the model is already loaded
- **Efficient resource utilization**: Minimizes model loading/unloading operations
### 3. Caching Layer
Three types of caches improve performance:
- **Models cache** (`_models_cache`): Caches available models per endpoint (300s TTL)
- **Loaded models cache** (`_loaded_models_cache`): Caches currently loaded models (30s TTL)
- **Error cache** (`_error_cache`): Caches transient errors (10s TTL)
### 4. Token Tracking System
Comprehensive token usage tracking:
```python
token_buffer: dict[str, dict[str, tuple[int, int]]] = defaultdict(lambda: defaultdict(lambda: (0, 0)))
time_series_buffer: list[dict[str, int | str]] = []
```
Features:
- Real-time token counting for input/output tokens
- Periodic flushing to SQLite database (every 10 seconds)
- Time-series data for historical analysis
- Per-endpoint, per-model breakdown
### 5. API Compatibility Layer
The router supports multiple API formats:
- **Ollama API**: Native `/api/generate`, `/api/chat`, `/api/embed` endpoints
- **OpenAI API**: Compatible `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings` endpoints
- **Transparent conversion**: Responses are converted between formats as needed
## Data Flow
### Request Processing
1. **Ingress**: Frontend sends request to router
2. **Endpoint Selection**: Router determines optimal endpoint
3. **Request Forwarding**: Request sent to selected Ollama endpoint
4. **Response Streaming**: Response streamed back to frontend
5. **Usage Tracking**: Connection and token counts updated
6. **Egress**: Complete response returned to frontend
### Connection Management
```mermaid
sequenceDiagram
participant Frontend
participant Router
participant Endpoint1
participant Endpoint2
Frontend->>Router: Request for model X
Router->>Endpoint1: Check if model X is loaded
Router->>Endpoint2: Check if model X is loaded
alt Endpoint1 has model X loaded
Router->>Endpoint1: Forward request
Endpoint1->>Router: Stream response
Router->>Frontend: Stream response
else Endpoint2 has model X loaded
Router->>Endpoint2: Forward request
Endpoint2->>Router: Stream response
Router->>Frontend: Stream response
else No endpoint has model X loaded
Router->>Endpoint1: Forward request (will trigger load)
Endpoint1->>Router: Stream response
Router->>Frontend: Stream response
end
```
## Advanced Features
### Multiple Opinions Ensemble (MOE)
When the user prefixes a model name with `moe-`, the router activates the MOE system:
1. Generates 3 responses from different endpoints
2. Generates 3 critiques of those responses
3. Selects the best response based on critiques
4. Generates final refined response
### OpenAI Endpoint Support
The router can proxy requests to OpenAI-compatible endpoints alongside Ollama endpoints. It automatically:
- Detects OpenAI endpoints (those containing `/v1`)
- Converts between Ollama and OpenAI response formats
- Handles authentication with API keys
- Maintains consistent behavior across endpoint types
## Performance Considerations
### Concurrency Model
- **Max concurrent connections**: Configurable per endpoint-model pair
- **Connection pooling**: Reuses aiohttp connections
- **Async I/O**: All operations are non-blocking
- **Backpressure handling**: Queues requests when endpoints are saturated
### Caching Strategy
- **Short TTL for loaded models** (30s): Ensures quick detection of model loading/unloading
- **Longer TTL for available models** (300s): Reduces unnecessary API calls
- **Error caching** (10s): Prevents thundering herd during outages
### Memory Management
- **Write-behind pattern**: Token counts buffered in memory, flushed periodically
- **Queue-based SSE**: Server-Sent Events use bounded queues to prevent memory bloat
- **Automatic cleanup**: Zero connection counts are removed from tracking
## Error Handling
### Transient Errors
- Temporary connection failures are cached for 10 seconds
- During cache period, endpoint is treated as unavailable
- After cache expires, endpoint is re-tested
### Permanent Errors
- Invalid model names result in clear error messages
- Missing required fields return 400 Bad Request
- Unreachable endpoints are reported with detailed connection issues
### Health Monitoring
The `/health` endpoint provides comprehensive health status:
```json
{
"status": "ok" | "error",
"endpoints": {
"http://endpoint1:11434": {
"status": "ok" | "error",
"version": "string" | "detail": "error message"
}
}
}
```
## Database Schema
The router uses SQLite for persistent storage:
```sql
CREATE TABLE token_counts (
endpoint TEXT NOT NULL,
model TEXT NOT NULL,
input_tokens INTEGER NOT NULL,
output_tokens INTEGER NOT NULL,
total_tokens INTEGER NOT NULL,
PRIMARY KEY (endpoint, model)
);
CREATE TABLE time_series (
endpoint TEXT NOT NULL,
model TEXT NOT NULL,
input_tokens INTEGER NOT NULL,
output_tokens INTEGER NOT NULL,
total_tokens INTEGER NOT NULL,
timestamp INTEGER NOT NULL,
PRIMARY KEY (endpoint, model, timestamp)
);
```
## Scaling Considerations
### Horizontal Scaling
- Multiple router instances can run behind a load balancer
- Each instance maintains its own connection tracking
- Stateless design allows for easy scaling
### Vertical Scaling
- Connection limits can be increased via aiohttp connector settings
- Memory usage grows with number of tracked connections
- Token buffer flushing interval can be adjusted
## Security
### Authentication
- API keys are stored in config.yaml (can use environment variables)
- Keys are passed to endpoints via Authorization headers
- No authentication required for router itself (can be added via middleware)
### Data Protection
- All communication uses TLS when configured
- No sensitive data logged (except in error messages)
- Database contains only token counts and timestamps
## Monitoring and Observability
### Metrics Endpoints
- `/api/usage`: Current connection counts
- `/api/token_counts`: Aggregated token usage
- `/api/stats`: Detailed statistics per model
- `/api/config`: Endpoint configuration and status
- `/api/usage-stream`: Real-time usage updates via SSE
### Logging
- Connection errors are logged with detailed context
- Endpoint selection decisions are logged
- Token counting operations are logged at debug level
## Future Enhancements
Potential areas for improvement:
- Kubernetes operator for automatic deployment
- Prometheus metrics endpoint
- Distributed connection tracking (Redis)
- Request retry logic with exponential backoff
- Circuit breaker pattern for failing endpoints
- Rate limiting per client

197
doc/configuration.md Normal file
View file

@ -0,0 +1,197 @@
# Configuration Guide
## Configuration File
The NOMYO Router is configured via a YAML file (default: `config.yaml`). This file defines the Ollama endpoints, connection limits, and API keys.
### Basic Configuration
```yaml
# config.yaml
endpoints:
- http://localhost:11434
- http://ollama-server:11434
# Maximum concurrent connections *per endpointmodel pair*
max_concurrent_connections: 2
```
### Complete Example
```yaml
# config.yaml
endpoints:
- http://192.168.0.50:11434
- http://192.168.0.51:11434
- http://192.168.0.52:11434
- https://api.openai.com/v1
# Maximum concurrent connections *per endpointmodel pair* (equals to OLLAMA_NUM_PARALLEL)
max_concurrent_connections: 2
# API keys for remote endpoints
# Set an environment variable like OPENAI_KEY
# Confirm endpoints are exactly as in endpoints block
api_keys:
"http://192.168.0.50:11434": "ollama"
"http://192.168.0.51:11434": "ollama"
"http://192.168.0.52:11434": "ollama"
"https://api.openai.com/v1": "${OPENAI_KEY}"
```
## Configuration Options
### `endpoints`
**Type**: `list[str]`
**Description**: List of Ollama endpoint URLs. Can include both Ollama endpoints (`http://host:11434`) and OpenAI-compatible endpoints (`https://api.openai.com/v1`).
**Examples**:
```yaml
endpoints:
- http://localhost:11434
- http://ollama1:11434
- http://ollama2:11434
- https://api.openai.com/v1
- https://api.anthropic.com/v1
```
**Notes**:
- Ollama endpoints use the standard `/api/` prefix
- OpenAI-compatible endpoints use `/v1` prefix
- The router automatically detects endpoint type based on URL pattern
### `max_concurrent_connections`
**Type**: `int`
**Default**: `1`
**Description**: Maximum number of concurrent connections allowed per endpoint-model pair. This corresponds to Ollama's `OLLAMA_NUM_PARALLEL` setting.
**Example**:
```yaml
max_concurrent_connections: 4
```
**Notes**:
- This setting controls how many requests can be processed simultaneously for a specific model on a specific endpoint
- When this limit is reached, the router will route requests to other endpoints with available capacity
- Higher values allow more parallel requests but may increase memory usage
### `api_keys`
**Type**: `dict[str, str]`
**Description**: Mapping of endpoint URLs to API keys. Used for authenticating with remote endpoints.
**Example**:
```yaml
api_keys:
"http://192.168.0.50:11434": "ollama"
"https://api.openai.com/v1": "${OPENAI_KEY}"
```
**Environment Variables**:
- API keys can reference environment variables using `${VAR_NAME}` syntax
- The router will expand these references at startup
- Example: `${OPENAI_KEY}` will be replaced with the value of the `OPENAI_KEY` environment variable
## Environment Variables
### `NOMYO_ROUTER_CONFIG_PATH`
**Description**: Path to the configuration file. If not set, defaults to `config.yaml` in the current working directory.
**Example**:
```bash
export NOMYO_ROUTER_CONFIG_PATH=/etc/nomyo-router/config.yaml
```
### `NOMYO_ROUTER_DB_PATH`
**Description**: Path to the SQLite database file for storing token counts. If not set, defaults to `token_counts.db` in the current working directory.
**Example**:
```bash
export NOMYO_ROUTER_DB_PATH=/var/lib/nomyo-router/token_counts.db
```
### API-Specific Keys
You can set API keys directly as environment variables:
```bash
export OPENAI_KEY=your_openai_api_key
export ANTHROPIC_KEY=your_anthropic_api_key
```
## Configuration Best Practices
### Multiple Ollama Instances
For a cluster of Ollama instances:
```yaml
endpoints:
- http://ollama-worker1:11434
- http://ollama-worker2:11434
- http://ollama-worker3:11434
max_concurrent_connections: 2
```
**Recommendation**: Set `max_concurrent_connections` to match your Ollama instances' `OLLAMA_NUM_PARALLEL` setting.
### Mixed Endpoints
Combining Ollama and OpenAI endpoints:
```yaml
endpoints:
- http://localhost:11434
- https://api.openai.com/v1
api_keys:
"https://api.openai.com/v1": "${OPENAI_KEY}"
```
**Note**: The router will automatically route requests based on model availability across all endpoints.
### High Availability
For production deployments:
```yaml
endpoints:
- http://ollama-primary:11434
- http://ollama-secondary:11434
- http://ollama-tertiary:11434
max_concurrent_connections: 3
```
**Recommendation**: Use multiple endpoints for redundancy and load distribution.
## Configuration Validation
The router validates the configuration at startup:
1. **Endpoint URLs**: Must be valid URLs
2. **API Keys**: Must be strings (can reference environment variables)
3. **Connection Limits**: Must be positive integers
If the configuration is invalid, the router will exit with an error message.
## Dynamic Configuration
The configuration is loaded at startup and cannot be changed without restarting the router. For production deployments, consider:
1. Using a configuration management system
2. Implementing a rolling restart strategy
3. Using environment variables for sensitive data
## Example Configurations
See the [examples](examples/) directory for ready-to-use configuration examples.

444
doc/deployment.md Normal file
View file

@ -0,0 +1,444 @@
# Deployment Guide
## Deployment Options
NOMYO Router can be deployed in various environments depending on your requirements.
## 1. Bare Metal / VM Deployment
### Prerequisites
- Python 3.10+
- pip
- Virtual environment (recommended)
### Installation
```bash
# Clone the repository
git clone https://github.com/nomyo-ai/nomyo-router.git
cd nomyo-router
# Create virtual environment
python3 -m venv .venv/router
source .venv/router/bin/activate
# Install dependencies
pip3 install -r requirements.txt
# Configure endpoints
nano config.yaml
```
### Running the Router
```bash
# Basic startup
uvicorn router:app --host 0.0.0.0 --port 12434
# With custom configuration path
export NOMYO_ROUTER_CONFIG_PATH=/etc/nomyo-router/config.yaml
uvicorn router:app --host 0.0.0.0 --port 12434
# With custom database path
export NOMYO_ROUTER_DB_PATH=/var/lib/nomyo-router/token_counts.db
uvicorn router:app --host 0.0.0.0 --port 12434
```
### Systemd Service
Create `/etc/systemd/system/nomyo-router.service`:
```ini
[Unit]
Description=NOMYO Router - Ollama Proxy
After=network.target
[Service]
User=nomyo
Group=nomyo
WorkingDirectory=/opt/nomyo-router
Environment="NOMYO_ROUTER_CONFIG_PATH=/etc/nomyo-router/config.yaml"
Environment="NOMYO_ROUTER_DB_PATH=/var/lib/nomyo-router/token_counts.db"
ExecStart=/opt/nomyo-router/.venv/router/bin/uvicorn router:app --host 0.0.0.0 --port 12434
Restart=always
RestartSec=10
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=nomyo-router
[Install]
WantedBy=multi-user.target
```
Enable and start the service:
```bash
sudo systemctl daemon-reload
sudo systemctl enable nomyo-router
sudo systemctl start nomyo-router
sudo systemctl status nomyo-router
```
## 2. Docker Deployment
### Build the Image
```bash
docker build -t nomyo-router .
```
### Run the Container
```bash
docker run -d \
--name nomyo-router \
-p 12434:12434 \
-v /absolute/path/to/config_folder:/app/config/ \
-e CONFIG_PATH=/app/config/config.yaml \
nomyo-router
```
### Advanced Docker Configuration
#### Custom Port
```bash
docker run -d \
--name nomyo-router \
-p 9000:12434 \
-v /path/to/config:/app/config/ \
-e CONFIG_PATH=/app/config/config.yaml \
nomyo-router \
-- --port 9000
```
#### Custom Host
```bash
docker run -d \
--name nomyo-router \
-p 12434:12434 \
-v /path/to/config:/app/config/ \
-e CONFIG_PATH=/app/config/config.yaml \
-e UVICORN_HOST=0.0.0.0 \
nomyo-router
```
#### Persistent Database
```bash
docker run -d \
--name nomyo-router \
-p 12434:12434 \
-v /path/to/config:/app/config/ \
-v /path/to/db:/app/token_counts.db \
-e CONFIG_PATH=/app/config/config.yaml \
-e NOMYO_ROUTER_DB_PATH=/app/token_counts.db \
nomyo-router
```
### Docker Compose Example
See [examples/docker-compose.yml](examples/docker-compose.yml) for a complete Docker Compose example.
## 3. Kubernetes Deployment
### Prerequisites
- Kubernetes cluster
- kubectl configured
- Helm (optional)
### Basic Deployment
Create `nomyo-router-deployment.yaml`:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nomyo-router
labels:
app: nomyo-router
spec:
replicas: 2
selector:
matchLabels:
app: nomyo-router
template:
metadata:
labels:
app: nomyo-router
spec:
containers:
- name: nomyo-router
image: nomyo-router:latest
ports:
- containerPort: 12434
env:
- name: CONFIG_PATH
value: "/app/config/config.yaml"
- name: NOMYO_ROUTER_DB_PATH
value: "/app/token_counts.db"
volumeMounts:
- name: config-volume
mountPath: /app/config
- name: db-volume
mountPath: /app/token_counts.db
subPath: token_counts.db
volumes:
- name: config-volume
configMap:
name: nomyo-router-config
- name: db-volume
persistentVolumeClaim:
claimName: nomyo-router-db-pvc
---
apiVersion: v1
kind: Service
metadata:
name: nomyo-router
spec:
selector:
app: nomyo-router
ports:
- protocol: TCP
port: 80
targetPort: 12434
type: LoadBalancer
---
apiVersion: v1
kind: ConfigMap
metadata:
name: nomyo-router-config
data:
config.yaml: |
endpoints:
- http://ollama-service:11434
max_concurrent_connections: 2
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nomyo-router-db-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
```
Apply the deployment:
```bash
kubectl apply -f nomyo-router-deployment.yaml
```
### Horizontal Pod Autoscaler
Create `nomyo-router-hpa.yaml`:
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: nomyo-router-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nomyo-router
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
```
Apply the HPA:
```bash
kubectl apply -f nomyo-router-hpa.yaml
```
## 4. Production Deployment
### High Availability Setup
For production environments with multiple Ollama instances:
```yaml
# config.yaml
endpoints:
- http://ollama-worker1:11434
- http://ollama-worker2:11434
- http://ollama-worker3:11434
- https://api.openai.com/v1
max_concurrent_connections: 4
api_keys:
"https://api.openai.com/v1": "${OPENAI_KEY}"
```
### Load Balancing
Deploy multiple router instances behind a load balancer:
```
┌───────────────────────────────────────────────────────────────┐
│ Load Balancer (NGINX, Traefik) │
└───────────────────────────────────────────────────────────────┘
├─┬───────────────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐
│ Router Instance │ │ Router Instance │ │ Router Instance │
│ (Pod 1) │ │ (Pod 2) │ │ (Pod 3) │
└─────────────────┘ └─────────────────┘ └─────────────────────────┘
┌───────────────────────────────────────────────────────────────┐
│ Ollama Cluster │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────────┐ │
│ │ Ollama │ │ Ollama │ │ OpenAI API │ │
│ │ Worker 1 │ │ Worker 2 │ │ (Fallback) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘
```
### Monitoring and Logging
#### Prometheus Monitoring
Create a Prometheus scrape configuration:
```yaml
scrape_configs:
- job_name: 'nomyo-router'
metrics_path: '/metrics'
static_configs:
- targets: ['nomyo-router:12434']
```
#### Logging
Configure log aggregation:
```bash
# In Docker
docker run -d \
--name nomyo-router \
-p 12434:12434 \
-v /path/to/config:/app/config/ \
-e CONFIG_PATH=/app/config/config.yaml \
--log-driver=fluentd \
--log-opt fluentd-address=fluentd:24224 \
nomyo-router
```
## Deployment Checklist
### Pre-Deployment
- [ ] Configure all Ollama endpoints
- [ ] Set appropriate `max_concurrent_connections`
- [ ] Configure API keys for remote endpoints
- [ ] Test configuration locally
- [ ] Set up monitoring and alerting
- [ ] Configure logging
- [ ] Set up backup for token counts database
### Post-Deployment
- [ ] Verify health endpoint: `curl http://<router>/health`
- [ ] Check endpoint status: `curl http://<router>/api/config`
- [ ] Monitor connection counts: `curl http://<router>/api/usage`
- [ ] Set up regular backups
- [ ] Configure auto-restart on failure
- [ ] Monitor performance metrics
## Scaling Guidelines
### Vertical Scaling
- Increase `max_concurrent_connections` for more parallel requests
- Add more CPU/memory to the router instance
- Monitor memory usage (token buffer grows with usage)
### Horizontal Scaling
- Deploy multiple router instances
- Use a load balancer to distribute traffic
- Each instance maintains its own connection tracking
- Database can be shared or per-instance
### Database Considerations
- SQLite is sufficient for single-instance deployments
- For multi-instance deployments, consider PostgreSQL
- Regular backups are recommended
- Database size grows with token usage history
## Security Best Practices
### Network Security
- Use TLS for all external connections
- Restrict access to router port (12434)
- Use firewall rules to limit access
- Consider using VPN for internal communications
### Configuration Security
- Store API keys in environment variables
- Restrict access to config.yaml
- Use secrets management for production deployments
- Rotate API keys regularly
### Runtime Security
- Run as non-root user
- Set appropriate file permissions
- Monitor for suspicious activity
- Keep dependencies updated
## Troubleshooting Deployment Issues
### Common Issues
**Problem**: Router not starting
- **Check**: Logs for configuration errors
- **Solution**: Validate config.yaml syntax
**Problem**: Endpoints showing as unavailable
- **Check**: Network connectivity from router to endpoints
- **Solution**: Verify firewall rules and DNS resolution
**Problem**: High latency
- **Check**: Endpoint health and connection counts
- **Solution**: Add more endpoints or increase concurrency limits
**Problem**: Database errors
- **Check**: Database file permissions
- **Solution**: Ensure write permissions for the database path
**Problem**: Connection limits being hit
- **Check**: `/api/usage` endpoint
- **Solution**: Increase `max_concurrent_connections` or add endpoints
## Examples
See the [examples](examples/) directory for ready-to-use deployment examples.

View file

@ -0,0 +1,98 @@
# Docker Compose example for NOMYO Router with multiple Ollama instances
version: '3.8'
services:
# NOMYO Router
nomyo-router:
image: nomyo-router:latest
build: .
ports:
- "12434:12434"
environment:
- CONFIG_PATH=/app/config/config.yaml
- NOMYO_ROUTER_DB_PATH=/app/token_counts.db
volumes:
- ./config:/app/config
- router-db:/app/token_counts.db
depends_on:
- ollama1
- ollama2
- ollama3
restart: unless-stopped
networks:
- nomyo-net
# Ollama Instance 1
ollama1:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama1-data:/root/.ollama
environment:
- OLLAMA_NUM_PARALLEL=4
restart: unless-stopped
networks:
- nomyo-net
# Ollama Instance 2
ollama2:
image: ollama/ollama:latest
ports:
- "11435:11434"
volumes:
- ollama2-data:/root/.ollama
environment:
- OLLAMA_NUM_PARALLEL=4
restart: unless-stopped
networks:
- nomyo-net
# Ollama Instance 3
ollama3:
image: ollama/ollama:latest
ports:
- "11436:11434"
volumes:
- ollama3-data:/root/.ollama
environment:
- OLLAMA_NUM_PARALLEL=4
restart: unless-stopped
networks:
- nomyo-net
# Optional: Prometheus for monitoring
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
restart: unless-stopped
networks:
- nomyo-net
# Optional: Grafana for visualization
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana-storage:/var/lib/grafana
restart: unless-stopped
networks:
- nomyo-net
volumes:
router-db:
ollama1-data:
ollama2-data:
ollama3-data:
grafana-storage:
networks:
nomyo-net:
driver: bridge

View file

@ -0,0 +1,37 @@
# Sample NOMYO Router Configuration
# Basic single endpoint configuration
endpoints:
- http://localhost:11434
max_concurrent_connections: 2
# Multi-endpoint configuration with local Ollama instances
# endpoints:
# - http://ollama-worker1:11434
# - http://ollama-worker2:11434
# - http://ollama-worker3:11434
# Mixed configuration with Ollama and OpenAI endpoints
# endpoints:
# - http://localhost:11434
# - https://api.openai.com/v1
# API keys for remote endpoints
# Use ${VAR_NAME} syntax to reference environment variables
api_keys:
# Local Ollama instances typically don't require authentication
"http://localhost:11434": "ollama"
# Remote Ollama instances
# "http://remote-ollama:11434": "ollama"
# OpenAI API
# "https://api.openai.com/v1": "${OPENAI_KEY}"
# Anthropic API
# "https://api.anthropic.com/v1": "${ANTHROPIC_KEY}"
# Other OpenAI-compatible endpoints
# "https://api.mistral.ai/v1": "${MISTRAL_KEY}"

515
doc/monitoring.md Normal file
View file

@ -0,0 +1,515 @@
# Monitoring and Troubleshooting Guide
## Monitoring Overview
NOMYO Router provides comprehensive monitoring capabilities to track performance, health, and usage patterns.
## Monitoring Endpoints
### Health Check
```bash
curl http://localhost:12434/health
```
Response:
```json
{
"status": "ok" | "error",
"endpoints": {
"http://endpoint1:11434": {
"status": "ok" | "error",
"version": "string" | "detail": "error message"
}
}
}
```
**HTTP Status Codes**:
- `200`: All endpoints healthy
- `503`: One or more endpoints unhealthy
### Current Usage
```bash
curl http://localhost:12434/api/usage
```
Response:
```json
{
"usage_counts": {
"http://endpoint1:11434": {
"llama3": 2,
"mistral": 1
},
"http://endpoint2:11434": {
"llama3": 0,
"mistral": 3
}
},
"token_usage_counts": {
"http://endpoint1:11434": {
"llama3": 1542,
"mistral": 876
}
}
}
```
### Token Statistics
```bash
curl http://localhost:12434/api/token_counts
```
Response:
```json
{
"total_tokens": 2418,
"breakdown": [
{
"endpoint": "http://endpoint1:11434",
"model": "llama3",
"input_tokens": 120,
"output_tokens": 1422,
"total_tokens": 1542
},
{
"endpoint": "http://endpoint1:11434",
"model": "mistral",
"input_tokens": 80,
"output_tokens": 796,
"total_tokens": 876
}
]
}
```
### Model Statistics
```bash
curl http://localhost:12434/api/stats -X POST -d '{"model": "llama3"}'
```
Response:
```json
{
"model": "llama3",
"input_tokens": 120,
"output_tokens": 1422,
"total_tokens": 1542,
"time_series": [
{
"endpoint": "http://endpoint1:11434",
"timestamp": 1712345600,
"input_tokens": 20,
"output_tokens": 150,
"total_tokens": 170
}
],
"endpoint_distribution": {
"http://endpoint1:11434": 1542
}
}
```
### Configuration Status
```bash
curl http://localhost:12434/api/config
```
Response:
```json
{
"endpoints": [
{
"url": "http://endpoint1:11434",
"status": "ok" | "error",
"version": "string" | "detail": "error message"
}
]
}
```
### Real-time Usage Stream
```bash
curl http://localhost:12434/api/usage-stream
```
This provides Server-Sent Events (SSE) with real-time updates:
```
data: {"usage_counts": {...}, "token_usage_counts": {...}}
data: {"usage_counts": {...}, "token_usage_counts": {...}}
```
## Monitoring Tools
### Prometheus Integration
Create a Prometheus scrape configuration:
```yaml
scrape_configs:
- job_name: 'nomyo-router'
metrics_path: '/api/usage'
params:
format: ['prometheus']
static_configs:
- targets: ['nomyo-router:12434']
```
### Grafana Dashboard
Create a dashboard with these panels:
- Endpoint health status
- Current connection counts
- Token usage (input/output/total)
- Request rates
- Response times
- Error rates
### Logging
The router logs important events to stdout:
- Configuration loading
- Endpoint connection issues
- Token counting operations
- Error conditions
## Troubleshooting Guide
### Common Issues and Solutions
#### 1. Endpoint Unavailable
**Symptoms**:
- Health check shows endpoint as "error"
- Requests fail with connection errors
**Diagnosis**:
```bash
curl http://localhost:12434/health
curl http://localhost:12434/api/config
```
**Solutions**:
- Verify Ollama endpoint is running
- Check network connectivity
- Verify firewall rules
- Check DNS resolution
- Test direct connection: `curl http://endpoint:11434/api/version`
#### 2. Model Not Found
**Symptoms**:
- Error: "None of the configured endpoints advertise the model"
- Requests fail with model not found
**Diagnosis**:
```bash
curl http://localhost:12434/api/tags
curl http://endpoint:11434/api/tags
```
**Solutions**:
- Pull the model on the endpoint: `curl http://endpoint:11434/api/pull -d '{"name": "llama3"}'`
- Verify model name spelling
- Check if model is available on any endpoint
- For OpenAI endpoints, ensure model exists in their catalog
#### 3. High Latency
**Symptoms**:
- Slow response times
- Requests timing out
**Diagnosis**:
```bash
curl http://localhost:12434/api/usage
curl http://localhost:12434/api/config
```
**Solutions**:
- Check if endpoints are overloaded (high connection counts)
- Increase `max_concurrent_connections`
- Add more endpoints to the cluster
- Monitor Ollama endpoint performance
- Check network latency between router and endpoints
#### 4. Connection Limits Reached
**Symptoms**:
- Requests queuing
- High connection counts
- Slow response times
**Diagnosis**:
```bash
curl http://localhost:12434/api/usage
```
**Solutions**:
- Increase `max_concurrent_connections` in config.yaml
- Add more Ollama endpoints
- Scale your Ollama cluster
- Use MOE system for critical queries
#### 5. Token Tracking Not Working
**Symptoms**:
- Token counts not updating
- Database errors
**Diagnosis**:
```bash
ls -la token_counts.db
curl http://localhost:12434/api/token_counts
```
**Solutions**:
- Verify database file permissions
- Check if database path is writable
- Restart router to rebuild database
- Check disk space
- Verify environment variable `NOMYO_ROUTER_DB_PATH`
#### 6. Streaming Issues
**Symptoms**:
- Incomplete responses
- Connection resets during streaming
- Timeout errors
**Diagnosis**:
- Check router logs for errors
- Test with non-streaming requests
- Monitor connection counts
**Solutions**:
- Increase timeout settings
- Reduce `max_concurrent_connections`
- Check network stability
- Test with smaller payloads
### Error Messages
#### "Failed to connect to endpoint"
**Cause**: Network connectivity issue
**Action**: Verify endpoint is reachable, check firewall, test DNS
#### "None of the configured endpoints advertise the model"
**Cause**: Model not pulled on any endpoint
**Action**: Pull the model, verify model name
#### "Timed out waiting for endpoint"
**Cause**: Endpoint slow to respond
**Action**: Check endpoint health, increase timeouts
#### "Invalid JSON format in request body"
**Cause**: Malformed request
**Action**: Validate request payload, check API documentation
#### "Missing required field 'model'"
**Cause**: Request missing model parameter
**Action**: Add model parameter to request
### Performance Tuning
#### Optimizing Connection Handling
1. **Adjust concurrency limits**:
```yaml
max_concurrent_connections: 4
```
2. **Monitor connection usage**:
```bash
curl http://localhost:12434/api/usage
```
3. **Scale horizontally**:
- Add more Ollama endpoints
- Deploy multiple router instances
#### Reducing Latency
1. **Keep models loaded**:
- Use frequently accessed models
- Monitor `/api/ps` for loaded models
2. **Optimize endpoint selection**:
- Distribute models across endpoints
- Balance load evenly
3. **Use caching**:
- Models cache (300s TTL)
- Loaded models cache (30s TTL)
#### Memory Management
1. **Monitor memory usage**:
- Token buffer grows with usage
- Time-series data accumulates
2. **Adjust flush interval**:
- Default: 10 seconds
- Can be increased for less frequent I/O
3. **Database maintenance**:
- Regular backups
- Archive old data periodically
### Database Management
#### Viewing Token Data
```bash
sqlite3 token_counts.db "SELECT * FROM token_counts;"
sqlite3 token_counts.db "SELECT * FROM time_series LIMIT 100;"
```
#### Aggregating Old Data
```bash
curl http://localhost:12434/api/aggregate_time_series_days \
-X POST \
-d '{"days": 30, "trim_old": true}'
```
#### Backing Up Database
```bash
cp token_counts.db token_counts.db.backup
```
#### Restoring Database
```bash
cp token_counts.db.backup token_counts.db
```
### Advanced Troubleshooting
#### Debugging Endpoint Selection
Enable debug logging to see endpoint selection decisions:
```bash
uvicorn router:app --host 0.0.0.0 --port 12434 --log-level debug
```
#### Testing Individual Endpoints
```bash
# Test endpoint directly
curl http://endpoint:11434/api/version
# Test model availability
curl http://endpoint:11434/api/tags
# Test model loading
curl http://endpoint:11434/api/ps
```
#### Network Diagnostics
```bash
# Test connectivity
nc -zv endpoint 11434
# Test DNS resolution
dig endpoint
# Test latency
ping endpoint
```
### Common Pitfalls
1. **Using localhost in Docker**:
- Inside Docker, `localhost` refers to the container itself
- Use `host.docker.internal` or Docker service names
2. **Incorrect model names**:
- Ollama: `llama3:latest`
- OpenAI: `gpt-4` (no version suffix)
3. **Missing API keys**:
- Remote endpoints require authentication
- Set keys in config.yaml or environment variables
4. **Firewall blocking**:
- Ensure port 11434 is open for Ollama
- Ensure port 12434 is open for router
5. **Insufficient resources**:
- Monitor CPU/memory on Ollama endpoints
- Adjust `max_concurrent_connections` accordingly
## Best Practices
### Monitoring Setup
1. **Set up health checks**:
- Monitor `/health` endpoint
- Alert on status "error"
2. **Track usage metrics**:
- Monitor connection counts
- Track token usage
- Watch for connection limits
3. **Log important events**:
- Configuration changes
- Endpoint failures
- Recovery events
4. **Regular backups**:
- Backup token_counts.db
- Schedule regular backups
- Test restore procedure
### Performance Monitoring
1. **Baseline metrics**:
- Establish normal usage patterns
- Track trends over time
2. **Alert thresholds**:
- Set alerts for high connection counts
- Monitor error rates
- Watch for latency spikes
3. **Capacity planning**:
- Track growth in token usage
- Plan for scaling needs
- Monitor resource utilization
### Incident Response
1. **Quick diagnosis**:
- Check health endpoint first
- Review recent logs
- Verify endpoint status
2. **Isolation**:
- Identify affected endpoints
- Isolate problematic components
- Fallback to healthy endpoints
3. **Recovery**:
- Restart router if needed
- Rebalance load
- Restore from backup if necessary
## Examples
See the [examples](examples/) directory for monitoring configuration examples.

348
doc/usage.md Normal file
View file

@ -0,0 +1,348 @@
# Usage Guide
## Quick Start
### 1. Install the Router
```bash
git clone https://github.com/nomyo-ai/nomyo-router.git
cd nomyo-router
python3 -m venv .venv/router
source .venv/router/bin/activate
pip3 install -r requirements.txt
```
### 2. Configure Endpoints
Edit `config.yaml`:
```yaml
endpoints:
- http://localhost:11434
max_concurrent_connections: 2
```
### 3. Run the Router
```bash
uvicorn router:app --host 0.0.0.0 --port 12434
```
### 4. Use the Router
Configure your frontend to point to `http://localhost:12434` instead of your Ollama instance.
## API Endpoints
### Ollama-Compatible Endpoints
The router provides all standard Ollama API endpoints:
| Endpoint | Method | Description |
| --------------- | ------ | --------------------- |
| `/api/generate` | POST | Generate text |
| `/api/chat` | POST | Chat completions |
| `/api/embed` | POST | Embeddings |
| `/api/tags` | GET | List available models |
| `/api/ps` | GET | List loaded models |
| `/api/show` | POST | Show model details |
| `/api/pull` | POST | Pull a model |
| `/api/push` | POST | Push a model |
| `/api/create` | POST | Create a model |
| `/api/copy` | POST | Copy a model |
| `/api/delete` | DELETE | Delete a model |
### OpenAI-Compatible Endpoints
For OpenAI API compatibility:
| Endpoint | Method | Description |
| ---------------------- | ------ | ---------------- |
| `/v1/chat/completions` | POST | Chat completions |
| `/v1/completions` | POST | Text completions |
| `/v1/embeddings` | POST | Embeddings |
| `/v1/models` | GET | List models |
### Monitoring Endpoints
| Endpoint | Method | Description |
| ---------------------------------- | ------ | ---------------------------------------- |
| `/api/usage` | GET | Current connection counts |
| `/api/token_counts` | GET | Token usage statistics |
| `/api/stats` | POST | Detailed model statistics |
| `/api/aggregate_time_series_days` | POST | Aggregate time series data into daily |
| `/api/version` | GET | Ollama version info |
| `/api/config` | GET | Endpoint configuration |
| `/api/usage-stream` | GET | Real-time usage updates (SSE) |
| `/health` | GET | Health check |
## Making Requests
### Basic Chat Request
```bash
curl http://localhost:12434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3:latest",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"stream": false
}'
```
### Streaming Response
```bash
curl http://localhost:12434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3:latest",
"messages": [
{"role": "user", "content": "Tell me a story"}
],
"stream": true
}'
```
### OpenAI API Format
```bash
curl http://localhost:12434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-nano",
"messages": [
{"role": "user", "content": "Hello"}
]
}'
```
## Advanced Features
### Multiple Opinions Ensemble (MOE)
Prefix your model name with `moe-` to enable the MOE system:
```bash
curl http://localhost:12434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "moe-llama3:latest",
"messages": [
{"role": "user", "content": "Explain quantum computing"}
]
}'
```
The MOE system:
1. Generates 3 responses from different endpoints
2. Generates 3 critiques of those responses
3. Selects the best response
4. Generates a final refined response
### Token Tracking
The router automatically tracks token usage:
```bash
curl http://localhost:12434/api/token_counts
```
Response:
```json
{
"total_tokens": 1542,
"breakdown": [
{
"endpoint": "http://localhost:11434",
"model": "llama3",
"input_tokens": 120,
"output_tokens": 1422,
"total_tokens": 1542
}
]
}
```
### Real-time Monitoring
Use Server-Sent Events to monitor usage in real-time:
```bash
curl http://localhost:12434/api/usage-stream
```
## Integration Examples
### Python Client
```python
import requests
url = "http://localhost:12434/api/chat"
data = {
"model": "llama3",
"messages": [{"role": "user", "content": "What is AI?"}],
"stream": False
}
response = requests.post(url, json=data)
print(response.json())
```
### JavaScript Client
```javascript
const response = await fetch('http://localhost:12434/api/chat', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'llama3',
messages: [{ role: 'user', content: 'Hello!' }],
stream: false
})
});
const data = await response.json();
console.log(data);
```
### Streaming with JavaScript
```javascript
const eventSource = new EventSource('http://localhost:12434/api/usage-stream');
eventSource.onmessage = (event) => {
const usage = JSON.parse(event.data);
console.log('Current usage:', usage);
};
```
## Python Ollama Client
```python
from ollama import Client
# Configure the client to use the router
client = Client(host='http://localhost:12434')
# Chat with a model
response = client.chat(
model='llama3:latest',
messages=[
{'role': 'user', 'content': 'Explain quantum computing'}
]
)
print(response['message']['content'])
# Generate text
response = client.generate(
model='llama3:latest',
prompt='Write a short poem about AI'
)
print(response['response'])
# List available models
models = client.list()['models']
print(f"Available models: {[m['name'] for m in models]}")
```
### Python OpenAI Client
```python
from openai import OpenAI
# Configure the client to use the router
client = OpenAI(
base_url='http://localhost:12434/v1',
api_key='not-needed' # API key is not required for local usage
)
# Chat completions
response = client.chat.completions.create(
model='gpt-4o-nano',
messages=[
{'role': 'user', 'content': 'What is the meaning of life?'}
]
)
print(response.choices[0].message.content)
# Text completions
response = client.completions.create(
model='llama3:latest',
prompt='Once upon a time'
)
print(response.choices[0].text)
# Embeddings
response = client.embeddings.create(
model='llama3:latest',
input='The quick brown fox jumps over the lazy dog'
)
print(f"Embedding length: {len(response.data[0].embedding)}")
# List models
response = client.models.list()
print(f"Available models: {[m.id for m in response.data]}")
```
## Best Practices
### 1. Model Selection
- Use the same model name across all endpoints
- For Ollama, append `:latest` or a specific version tag
- For OpenAI endpoints, use the model name without version
### 2. Connection Management
- Set `max_concurrent_connections` appropriately for your hardware
- Monitor `/api/usage` to ensure endpoints aren't overloaded
- Consider using the MOE system for critical queries
### 3. Error Handling
- Check the `/health` endpoint regularly
- Implement retry logic with exponential backoff
- Monitor error rates and connection failures
### 4. Performance
- Keep frequently used models loaded on multiple endpoints
- Use streaming for large responses
- Monitor token usage to optimize costs
## Troubleshooting
### Common Issues
**Problem**: Model not found
- **Solution**: Ensure the model is pulled on at least one endpoint
- **Check**: `curl http://localhost:12434/api/tags`
**Problem**: Connection refused
- **Solution**: Verify Ollama endpoints are running and accessible
- **Check**: `curl http://localhost:12434/health`
**Problem**: High latency
- **Solution**: Check endpoint health and connection counts
- **Check**: `curl http://localhost:12434/api/usage`
**Problem**: Token tracking not working
- **Solution**: Ensure database path is writable
- **Check**: `ls -la token_counts.db`
## Examples
See the [examples](examples/) directory for complete integration examples.

174
router.py
View file

@ -9,6 +9,9 @@ license: AGPL
import orjson, time, asyncio, yaml, ollama, openai, os, re, aiohttp, ssl, random, base64, io, enhance
from datetime import datetime, timezone
from pathlib import Path
# Directory containing static files (relative to this script)
STATIC_DIR = Path(__file__).parent / "static"
from typing import Dict, Set, List, Optional
from urllib.parse import urlparse
from fastapi import FastAPI, Request, HTTPException
@ -55,6 +58,8 @@ flush_task: asyncio.Task | None = None
token_buffer: dict[str, dict[str, tuple[int, int]]] = defaultdict(lambda: defaultdict(lambda: (0, 0)))
# Time series buffer with timestamp
time_series_buffer: list[dict[str, int | str]] = []
# Lock to protect buffer access from race conditions
buffer_lock = asyncio.Lock()
# Configuration for periodic flushing
FLUSH_INTERVAL = 10 # seconds
@ -218,45 +223,112 @@ def is_ext_openai_endpoint(endpoint: str) -> bool:
return True # It's an external OpenAI endpoint
async def token_worker() -> None:
while True:
endpoint, model, prompt, comp = await token_queue.get()
# Accumulate counts in memory buffer
token_buffer[endpoint][model] = (
token_buffer[endpoint].get(model, (0, 0))[0] + prompt,
token_buffer[endpoint].get(model, (0, 0))[1] + comp
)
try:
while True:
endpoint, model, prompt, comp = await token_queue.get()
# Accumulate counts in memory buffer (protected by lock)
async with buffer_lock:
token_buffer[endpoint][model] = (
token_buffer[endpoint].get(model, (0, 0))[0] + prompt,
token_buffer[endpoint].get(model, (0, 0))[1] + comp
)
# Add to time series buffer with timestamp (UTC)
now = datetime.now(tz=timezone.utc)
timestamp = int(datetime(now.year, now.month, now.day, now.hour, now.minute, tzinfo=timezone.utc).timestamp())
time_series_buffer.append({
'endpoint': endpoint,
'model': model,
'input_tokens': prompt,
'output_tokens': comp,
'total_tokens': prompt + comp,
'timestamp': timestamp
})
# Add to time series buffer with timestamp (UTC)
now = datetime.now(tz=timezone.utc)
timestamp = int(datetime(now.year, now.month, now.day, now.hour, now.minute, tzinfo=timezone.utc).timestamp())
time_series_buffer.append({
'endpoint': endpoint,
'model': model,
'input_tokens': prompt,
'output_tokens': comp,
'total_tokens': prompt + comp,
'timestamp': timestamp
})
# Update in-memory counts for immediate reporting
async with token_usage_lock:
token_usage_counts[endpoint][model] += (prompt + comp)
await publish_snapshot()
# Update in-memory counts for immediate reporting
async with token_usage_lock:
token_usage_counts[endpoint][model] += (prompt + comp)
await publish_snapshot()
except asyncio.CancelledError:
# Gracefully handle task cancellation during shutdown
print("[token_worker] Task cancelled, processing remaining queue items...")
# Process any remaining items in the queue before exiting
while not token_queue.empty():
try:
endpoint, model, prompt, comp = token_queue.get_nowait()
async with buffer_lock:
token_buffer[endpoint][model] = (
token_buffer[endpoint].get(model, (0, 0))[0] + prompt,
token_buffer[endpoint].get(model, (0, 0))[1] + comp
)
now = datetime.now(tz=timezone.utc)
timestamp = int(datetime(now.year, now.month, now.day, now.hour, now.minute, tzinfo=timezone.utc).timestamp())
time_series_buffer.append({
'endpoint': endpoint,
'model': model,
'input_tokens': prompt,
'output_tokens': comp,
'total_tokens': prompt + comp,
'timestamp': timestamp
})
async with token_usage_lock:
token_usage_counts[endpoint][model] += (prompt + comp)
await publish_snapshot()
except asyncio.QueueEmpty:
break
print("[token_worker] Task cancelled, remaining items processed.")
raise
async def flush_buffer() -> None:
"""Periodically flush accumulated token counts to the database."""
while True:
await asyncio.sleep(FLUSH_INTERVAL)
try:
while True:
await asyncio.sleep(FLUSH_INTERVAL)
# Flush token counts
if token_buffer:
await db.update_batched_counts(token_buffer)
token_buffer.clear()
# Flush token counts and time series (protected by lock)
async with buffer_lock:
if token_buffer:
# Copy buffer before releasing lock for DB operation
buffer_copy = {ep: dict(models) for ep, models in token_buffer.items()}
token_buffer.clear()
else:
buffer_copy = None
# Flush time series entries
if time_series_buffer:
await db.add_batched_time_series(time_series_buffer)
time_series_buffer.clear()
if time_series_buffer:
ts_copy = list(time_series_buffer)
time_series_buffer.clear()
else:
ts_copy = None
# Perform DB operations outside the lock to avoid blocking
if buffer_copy:
await db.update_batched_counts(buffer_copy)
if ts_copy:
await db.add_batched_time_series(ts_copy)
except asyncio.CancelledError:
# Gracefully handle task cancellation during shutdown
print("[flush_buffer] Task cancelled, flushing remaining buffers...")
# Flush any remaining data before exiting
try:
async with buffer_lock:
if token_buffer:
buffer_copy = {ep: dict(models) for ep, models in token_buffer.items()}
token_buffer.clear()
else:
buffer_copy = None
if time_series_buffer:
ts_copy = list(time_series_buffer)
time_series_buffer.clear()
else:
ts_copy = None
if buffer_copy:
await db.update_batched_counts(buffer_copy)
if ts_copy:
await db.add_batched_time_series(ts_copy)
print("[flush_buffer] Task cancelled, remaining buffers flushed.")
except Exception as e:
print(f"[flush_buffer] Error during shutdown flush: {e}")
raise
async def flush_remaining_buffers() -> None:
"""
@ -265,14 +337,24 @@ async def flush_remaining_buffers() -> None:
"""
try:
flushed_entries = 0
if token_buffer:
await db.update_batched_counts(token_buffer)
flushed_entries += sum(len(v) for v in token_buffer.values())
token_buffer.clear()
if time_series_buffer:
await db.add_batched_time_series(time_series_buffer)
flushed_entries += len(time_series_buffer)
time_series_buffer.clear()
async with buffer_lock:
if token_buffer:
buffer_copy = {ep: dict(models) for ep, models in token_buffer.items()}
flushed_entries += sum(len(v) for v in token_buffer.values())
token_buffer.clear()
else:
buffer_copy = None
if time_series_buffer:
ts_copy = list(time_series_buffer)
flushed_entries += len(time_series_buffer)
time_series_buffer.clear()
else:
ts_copy = None
# Perform DB operations outside the lock
if buffer_copy:
await db.update_batched_counts(buffer_copy)
if ts_copy:
await db.add_batched_time_series(ts_copy)
if flushed_entries:
print(f"[shutdown] Flushed {flushed_entries} in-memory entries to DB on shutdown.")
else:
@ -884,7 +966,9 @@ async def choose_endpoint(model: str) -> str:
]
if endpoints_with_free_slot:
return random.choice(endpoints_with_free_slot)
#return random.choice(endpoints_with_free_slot)
endpoints_with_free_slot.sort(key=lambda ep: sum(usage_counts.get(ep, {}).values()))
return endpoints_with_free_slot[0]
# 5⃣ All candidate endpoints are saturated pick one with lowest usages count (will queue)
ep = min(candidate_endpoints, key=current_usage)
@ -2053,7 +2137,13 @@ async def index(request: Request):
Render the dynamic NOMYO Router dashboard listing the configured endpoints
and the models details, availability & task status.
"""
return HTMLResponse(content=open("static/index.html", "r").read(), status_code=200)
index_path = STATIC_DIR / "index.html"
try:
return HTMLResponse(content=index_path.read_text(encoding="utf-8"), status_code=200)
except FileNotFoundError:
raise HTTPException(status_code=404, detail="Page not found")
except Exception:
raise HTTPException(status_code=500, detail="Internal server error")
# -------------------------------------------------------------
# 26. Healthendpoint