diff --git a/doc/README.md b/doc/README.md new file mode 100644 index 0000000..70926b7 --- /dev/null +++ b/doc/README.md @@ -0,0 +1,137 @@ +# NOMYO Router Documentation + +Welcome to the NOMYO Router documentation! This folder contains comprehensive guides for using, configuring, and deploying the NOMYO Router. + +## Documentation Structure + +``` +doc/ +├── architecture.md # Technical architecture overview +├── configuration.md # Detailed configuration guide +├── usage.md # API usage examples +├── deployment.md # Deployment scenarios +├── monitoring.md # Monitoring and troubleshooting +└── examples/ # Example configurations and scripts + ├── docker-compose.yml + ├── sample-config.yaml + └── k8s-deployment.yaml +``` + +## Getting Started + +### Quick Start Guide + +1. **Install the router**: + + ```bash + git clone https://github.com/nomyo-ai/nomyo-router.git + cd nomyo-router + python3 -m venv .venv/router + source .venv/router/bin/activate + pip3 install -r requirements.txt + ``` +2. **Configure endpoints** in `config.yaml`: + + ```yaml + endpoints: + - http://localhost:11434 + max_concurrent_connections: 2 + ``` +3. **Run the router**: + + ```bash + uvicorn router:app --host 0.0.0.0 --port 12434 + ``` +4. **Use the router**: Point your frontend to `http://localhost:12434` instead of your Ollama instance. + +### Key Features + +- **Intelligent Routing**: Model deployment-aware routing with load balancing +- **Multi-Endpoint Support**: Combine Ollama and OpenAI-compatible endpoints +- **Token Tracking**: Comprehensive token usage monitoring +- **Real-time Monitoring**: Server-Sent Events for live usage updates +- **OpenAI Compatibility**: Full OpenAI API compatibility layer +- **MOE System**: Multiple Opinions Ensemble for improved responses with smaller models + +## Documentation Guides + +### [Architecture](architecture.md) + +Learn about the router's internal architecture, routing algorithm, caching mechanisms, and advanced features like the MOE system. + +### [Configuration](configuration.md) + +Detailed guide on configuring the router with multiple endpoints, API keys, and environment variables. + +### [Usage](usage.md) + +Comprehensive API reference with examples for making requests, streaming responses, and using advanced features. + +### [Deployment](deployment.md) + +Step-by-step deployment guides for bare metal, Docker, Kubernetes, and production environments. + +### [Monitoring](monitoring.md) + +Monitoring endpoints, troubleshooting guides, performance tuning, and best practices for maintaining your router. + +## Examples + +The [examples](examples/) directory contains ready-to-use configuration files: + +- **docker-compose.yml**: Complete Docker Compose setup with multiple Ollama instances +- **sample-config.yaml**: Example configuration with comments +- **k8s-deployment.yaml**: Kubernetes deployment manifests + +## Need Help? + +### Common Issues + +Check the [Monitoring Guide](monitoring.md) for troubleshooting common problems: + +- Endpoint unavailable +- Model not found +- High latency +- Connection limits reached +- Token tracking issues + +### Support + +For additional help: + +1. Check the [GitHub Issues](https://github.com/nomyo-ai/nomyo-router/issues) +2. Review the [Monitoring Guide](monitoring.md) for diagnostics +3. Examine the router logs for detailed error messages + +## Best Practices + +### Configuration + +- Use environment variables for API keys +- Set appropriate `max_concurrent_connections` based on your hardware +- Monitor endpoint health regularly +- Keep models loaded on multiple endpoints for redundancy + +### Deployment + +- Use Docker for containerized deployments +- Consider Kubernetes for production at scale +- Set up monitoring and alerting +- Implement regular backups of token counts database + +### Performance + +- Balance load across multiple endpoints +- Keep frequently used models loaded +- Monitor connection counts and token usage +- Scale horizontally when needed + +## Next Steps + +1. **Read the [Architecture Guide](architecture.md)** to understand how the router works +2. **Configure your endpoints** in `config.yaml` +3. **Deploy the router** using your preferred method +4. **Monitor your setup** using the monitoring endpoints +5. **Scale as needed** by adding more endpoints + +Happy routing! 🚀 diff --git a/doc/architecture.md b/doc/architecture.md new file mode 100644 index 0000000..aa96855 --- /dev/null +++ b/doc/architecture.md @@ -0,0 +1,259 @@ +# NOMYO Router Architecture + +## Overview + +NOMYO Router is a transparent proxy for Ollama with model deployment-aware routing. It sits between your frontend application and Ollama backend(s), providing intelligent request routing based on model availability and load balancing. + +## Core Components + +### 1. Request Routing Engine + +The router's core intelligence is in the `choose_endpoint()` function, which implements a sophisticated routing algorithm: + +```python +async def choose_endpoint(model: str) -> str: + """ + Endpoint selection algorithm: + 1. Query all endpoints for advertised models + 2. Filter endpoints that advertise the requested model + 3. Among candidates, find those with the model loaded AND free slots + 4. If none loaded with free slots, pick any with free slots + 5. If all saturated, pick endpoint with lowest current usage + 6. If no endpoint advertises the model, raise error + """ +``` + +### 2. Connection Tracking + +The router maintains real-time connection counts per endpoint-model pair: + +```python +usage_counts: Dict[str, Dict[str, int]] = defaultdict(lambda: defaultdict(int)) +``` + +This allows for: + +- **Load-aware routing**: Requests are routed to endpoints with available capacity +- **Model-aware routing**: Requests are routed to endpoints where the model is already loaded +- **Efficient resource utilization**: Minimizes model loading/unloading operations + +### 3. Caching Layer + +Three types of caches improve performance: + +- **Models cache** (`_models_cache`): Caches available models per endpoint (300s TTL) +- **Loaded models cache** (`_loaded_models_cache`): Caches currently loaded models (30s TTL) +- **Error cache** (`_error_cache`): Caches transient errors (10s TTL) + +### 4. Token Tracking System + +Comprehensive token usage tracking: + +```python +token_buffer: dict[str, dict[str, tuple[int, int]]] = defaultdict(lambda: defaultdict(lambda: (0, 0))) +time_series_buffer: list[dict[str, int | str]] = [] +``` + +Features: + +- Real-time token counting for input/output tokens +- Periodic flushing to SQLite database (every 10 seconds) +- Time-series data for historical analysis +- Per-endpoint, per-model breakdown + +### 5. API Compatibility Layer + +The router supports multiple API formats: + +- **Ollama API**: Native `/api/generate`, `/api/chat`, `/api/embed` endpoints +- **OpenAI API**: Compatible `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings` endpoints +- **Transparent conversion**: Responses are converted between formats as needed + +## Data Flow + +### Request Processing + +1. **Ingress**: Frontend sends request to router +2. **Endpoint Selection**: Router determines optimal endpoint +3. **Request Forwarding**: Request sent to selected Ollama endpoint +4. **Response Streaming**: Response streamed back to frontend +5. **Usage Tracking**: Connection and token counts updated +6. **Egress**: Complete response returned to frontend + +### Connection Management + +```mermaid +sequenceDiagram + participant Frontend + participant Router + participant Endpoint1 + participant Endpoint2 + + Frontend->>Router: Request for model X + Router->>Endpoint1: Check if model X is loaded + Router->>Endpoint2: Check if model X is loaded + alt Endpoint1 has model X loaded + Router->>Endpoint1: Forward request + Endpoint1->>Router: Stream response + Router->>Frontend: Stream response + else Endpoint2 has model X loaded + Router->>Endpoint2: Forward request + Endpoint2->>Router: Stream response + Router->>Frontend: Stream response + else No endpoint has model X loaded + Router->>Endpoint1: Forward request (will trigger load) + Endpoint1->>Router: Stream response + Router->>Frontend: Stream response + end +``` + +## Advanced Features + +### Multiple Opinions Ensemble (MOE) + +When the user prefixes a model name with `moe-`, the router activates the MOE system: + +1. Generates 3 responses from different endpoints +2. Generates 3 critiques of those responses +3. Selects the best response based on critiques +4. Generates final refined response + +### OpenAI Endpoint Support + +The router can proxy requests to OpenAI-compatible endpoints alongside Ollama endpoints. It automatically: + +- Detects OpenAI endpoints (those containing `/v1`) +- Converts between Ollama and OpenAI response formats +- Handles authentication with API keys +- Maintains consistent behavior across endpoint types + +## Performance Considerations + +### Concurrency Model + +- **Max concurrent connections**: Configurable per endpoint-model pair +- **Connection pooling**: Reuses aiohttp connections +- **Async I/O**: All operations are non-blocking +- **Backpressure handling**: Queues requests when endpoints are saturated + +### Caching Strategy + +- **Short TTL for loaded models** (30s): Ensures quick detection of model loading/unloading +- **Longer TTL for available models** (300s): Reduces unnecessary API calls +- **Error caching** (10s): Prevents thundering herd during outages + +### Memory Management + +- **Write-behind pattern**: Token counts buffered in memory, flushed periodically +- **Queue-based SSE**: Server-Sent Events use bounded queues to prevent memory bloat +- **Automatic cleanup**: Zero connection counts are removed from tracking + +## Error Handling + +### Transient Errors + +- Temporary connection failures are cached for 10 seconds +- During cache period, endpoint is treated as unavailable +- After cache expires, endpoint is re-tested + +### Permanent Errors + +- Invalid model names result in clear error messages +- Missing required fields return 400 Bad Request +- Unreachable endpoints are reported with detailed connection issues + +### Health Monitoring + +The `/health` endpoint provides comprehensive health status: + +```json +{ + "status": "ok" | "error", + "endpoints": { + "http://endpoint1:11434": { + "status": "ok" | "error", + "version": "string" | "detail": "error message" + } + } +} +``` + +## Database Schema + +The router uses SQLite for persistent storage: + +```sql +CREATE TABLE token_counts ( + endpoint TEXT NOT NULL, + model TEXT NOT NULL, + input_tokens INTEGER NOT NULL, + output_tokens INTEGER NOT NULL, + total_tokens INTEGER NOT NULL, + PRIMARY KEY (endpoint, model) +); + +CREATE TABLE time_series ( + endpoint TEXT NOT NULL, + model TEXT NOT NULL, + input_tokens INTEGER NOT NULL, + output_tokens INTEGER NOT NULL, + total_tokens INTEGER NOT NULL, + timestamp INTEGER NOT NULL, + PRIMARY KEY (endpoint, model, timestamp) +); +``` + +## Scaling Considerations + +### Horizontal Scaling + +- Multiple router instances can run behind a load balancer +- Each instance maintains its own connection tracking +- Stateless design allows for easy scaling + +### Vertical Scaling + +- Connection limits can be increased via aiohttp connector settings +- Memory usage grows with number of tracked connections +- Token buffer flushing interval can be adjusted + +## Security + +### Authentication + +- API keys are stored in config.yaml (can use environment variables) +- Keys are passed to endpoints via Authorization headers +- No authentication required for router itself (can be added via middleware) + +### Data Protection + +- All communication uses TLS when configured +- No sensitive data logged (except in error messages) +- Database contains only token counts and timestamps + +## Monitoring and Observability + +### Metrics Endpoints + +- `/api/usage`: Current connection counts +- `/api/token_counts`: Aggregated token usage +- `/api/stats`: Detailed statistics per model +- `/api/config`: Endpoint configuration and status +- `/api/usage-stream`: Real-time usage updates via SSE + +### Logging + +- Connection errors are logged with detailed context +- Endpoint selection decisions are logged +- Token counting operations are logged at debug level + +## Future Enhancements + +Potential areas for improvement: + +- Kubernetes operator for automatic deployment +- Prometheus metrics endpoint +- Distributed connection tracking (Redis) +- Request retry logic with exponential backoff +- Circuit breaker pattern for failing endpoints +- Rate limiting per client diff --git a/doc/configuration.md b/doc/configuration.md new file mode 100644 index 0000000..7ca9b1f --- /dev/null +++ b/doc/configuration.md @@ -0,0 +1,197 @@ +# Configuration Guide + +## Configuration File + +The NOMYO Router is configured via a YAML file (default: `config.yaml`). This file defines the Ollama endpoints, connection limits, and API keys. + +### Basic Configuration + +```yaml +# config.yaml +endpoints: + - http://localhost:11434 + - http://ollama-server:11434 + +# Maximum concurrent connections *per endpoint‑model pair* +max_concurrent_connections: 2 +``` + +### Complete Example + +```yaml +# config.yaml +endpoints: + - http://192.168.0.50:11434 + - http://192.168.0.51:11434 + - http://192.168.0.52:11434 + - https://api.openai.com/v1 + +# Maximum concurrent connections *per endpoint‑model pair* (equals to OLLAMA_NUM_PARALLEL) +max_concurrent_connections: 2 + +# API keys for remote endpoints +# Set an environment variable like OPENAI_KEY +# Confirm endpoints are exactly as in endpoints block +api_keys: + "http://192.168.0.50:11434": "ollama" + "http://192.168.0.51:11434": "ollama" + "http://192.168.0.52:11434": "ollama" + "https://api.openai.com/v1": "${OPENAI_KEY}" +``` + +## Configuration Options + +### `endpoints` + +**Type**: `list[str]` + +**Description**: List of Ollama endpoint URLs. Can include both Ollama endpoints (`http://host:11434`) and OpenAI-compatible endpoints (`https://api.openai.com/v1`). + +**Examples**: +```yaml +endpoints: + - http://localhost:11434 + - http://ollama1:11434 + - http://ollama2:11434 + - https://api.openai.com/v1 + - https://api.anthropic.com/v1 +``` + +**Notes**: +- Ollama endpoints use the standard `/api/` prefix +- OpenAI-compatible endpoints use `/v1` prefix +- The router automatically detects endpoint type based on URL pattern + +### `max_concurrent_connections` + +**Type**: `int` + +**Default**: `1` + +**Description**: Maximum number of concurrent connections allowed per endpoint-model pair. This corresponds to Ollama's `OLLAMA_NUM_PARALLEL` setting. + +**Example**: +```yaml +max_concurrent_connections: 4 +``` + +**Notes**: +- This setting controls how many requests can be processed simultaneously for a specific model on a specific endpoint +- When this limit is reached, the router will route requests to other endpoints with available capacity +- Higher values allow more parallel requests but may increase memory usage + +### `api_keys` + +**Type**: `dict[str, str]` + +**Description**: Mapping of endpoint URLs to API keys. Used for authenticating with remote endpoints. + +**Example**: +```yaml +api_keys: + "http://192.168.0.50:11434": "ollama" + "https://api.openai.com/v1": "${OPENAI_KEY}" +``` + +**Environment Variables**: +- API keys can reference environment variables using `${VAR_NAME}` syntax +- The router will expand these references at startup +- Example: `${OPENAI_KEY}` will be replaced with the value of the `OPENAI_KEY` environment variable + +## Environment Variables + +### `NOMYO_ROUTER_CONFIG_PATH` + +**Description**: Path to the configuration file. If not set, defaults to `config.yaml` in the current working directory. + +**Example**: +```bash +export NOMYO_ROUTER_CONFIG_PATH=/etc/nomyo-router/config.yaml +``` + +### `NOMYO_ROUTER_DB_PATH` + +**Description**: Path to the SQLite database file for storing token counts. If not set, defaults to `token_counts.db` in the current working directory. + +**Example**: +```bash +export NOMYO_ROUTER_DB_PATH=/var/lib/nomyo-router/token_counts.db +``` + +### API-Specific Keys + +You can set API keys directly as environment variables: + +```bash +export OPENAI_KEY=your_openai_api_key +export ANTHROPIC_KEY=your_anthropic_api_key +``` + +## Configuration Best Practices + +### Multiple Ollama Instances + +For a cluster of Ollama instances: + +```yaml +endpoints: + - http://ollama-worker1:11434 + - http://ollama-worker2:11434 + - http://ollama-worker3:11434 + +max_concurrent_connections: 2 +``` + +**Recommendation**: Set `max_concurrent_connections` to match your Ollama instances' `OLLAMA_NUM_PARALLEL` setting. + +### Mixed Endpoints + +Combining Ollama and OpenAI endpoints: + +```yaml +endpoints: + - http://localhost:11434 + - https://api.openai.com/v1 + +api_keys: + "https://api.openai.com/v1": "${OPENAI_KEY}" +``` + +**Note**: The router will automatically route requests based on model availability across all endpoints. + +### High Availability + +For production deployments: + +```yaml +endpoints: + - http://ollama-primary:11434 + - http://ollama-secondary:11434 + - http://ollama-tertiary:11434 + +max_concurrent_connections: 3 +``` + +**Recommendation**: Use multiple endpoints for redundancy and load distribution. + +## Configuration Validation + +The router validates the configuration at startup: + +1. **Endpoint URLs**: Must be valid URLs +2. **API Keys**: Must be strings (can reference environment variables) +3. **Connection Limits**: Must be positive integers + +If the configuration is invalid, the router will exit with an error message. + +## Dynamic Configuration + +The configuration is loaded at startup and cannot be changed without restarting the router. For production deployments, consider: + +1. Using a configuration management system +2. Implementing a rolling restart strategy +3. Using environment variables for sensitive data + +## Example Configurations + +See the [examples](examples/) directory for ready-to-use configuration examples. diff --git a/doc/deployment.md b/doc/deployment.md new file mode 100644 index 0000000..f7220f8 --- /dev/null +++ b/doc/deployment.md @@ -0,0 +1,444 @@ +# Deployment Guide + +## Deployment Options + +NOMYO Router can be deployed in various environments depending on your requirements. + +## 1. Bare Metal / VM Deployment + +### Prerequisites + +- Python 3.10+ +- pip +- Virtual environment (recommended) + +### Installation + +```bash +# Clone the repository +git clone https://github.com/nomyo-ai/nomyo-router.git +cd nomyo-router + +# Create virtual environment +python3 -m venv .venv/router +source .venv/router/bin/activate + +# Install dependencies +pip3 install -r requirements.txt + +# Configure endpoints +nano config.yaml +``` + +### Running the Router + +```bash +# Basic startup +uvicorn router:app --host 0.0.0.0 --port 12434 + +# With custom configuration path +export NOMYO_ROUTER_CONFIG_PATH=/etc/nomyo-router/config.yaml +uvicorn router:app --host 0.0.0.0 --port 12434 + +# With custom database path +export NOMYO_ROUTER_DB_PATH=/var/lib/nomyo-router/token_counts.db +uvicorn router:app --host 0.0.0.0 --port 12434 +``` + +### Systemd Service + +Create `/etc/systemd/system/nomyo-router.service`: + +```ini +[Unit] +Description=NOMYO Router - Ollama Proxy +After=network.target + +[Service] +User=nomyo +Group=nomyo +WorkingDirectory=/opt/nomyo-router +Environment="NOMYO_ROUTER_CONFIG_PATH=/etc/nomyo-router/config.yaml" +Environment="NOMYO_ROUTER_DB_PATH=/var/lib/nomyo-router/token_counts.db" +ExecStart=/opt/nomyo-router/.venv/router/bin/uvicorn router:app --host 0.0.0.0 --port 12434 +Restart=always +RestartSec=10 +StandardOutput=syslog +StandardError=syslog +SyslogIdentifier=nomyo-router + +[Install] +WantedBy=multi-user.target +``` + +Enable and start the service: + +```bash +sudo systemctl daemon-reload +sudo systemctl enable nomyo-router +sudo systemctl start nomyo-router +sudo systemctl status nomyo-router +``` + +## 2. Docker Deployment + +### Build the Image + +```bash +docker build -t nomyo-router . +``` + +### Run the Container + +```bash +docker run -d \ + --name nomyo-router \ + -p 12434:12434 \ + -v /absolute/path/to/config_folder:/app/config/ \ + -e CONFIG_PATH=/app/config/config.yaml \ + nomyo-router +``` + +### Advanced Docker Configuration + +#### Custom Port + +```bash +docker run -d \ + --name nomyo-router \ + -p 9000:12434 \ + -v /path/to/config:/app/config/ \ + -e CONFIG_PATH=/app/config/config.yaml \ + nomyo-router \ + -- --port 9000 +``` + +#### Custom Host + +```bash +docker run -d \ + --name nomyo-router \ + -p 12434:12434 \ + -v /path/to/config:/app/config/ \ + -e CONFIG_PATH=/app/config/config.yaml \ + -e UVICORN_HOST=0.0.0.0 \ + nomyo-router +``` + +#### Persistent Database + +```bash +docker run -d \ + --name nomyo-router \ + -p 12434:12434 \ + -v /path/to/config:/app/config/ \ + -v /path/to/db:/app/token_counts.db \ + -e CONFIG_PATH=/app/config/config.yaml \ + -e NOMYO_ROUTER_DB_PATH=/app/token_counts.db \ + nomyo-router +``` + +### Docker Compose Example + +See [examples/docker-compose.yml](examples/docker-compose.yml) for a complete Docker Compose example. + +## 3. Kubernetes Deployment + +### Prerequisites + +- Kubernetes cluster +- kubectl configured +- Helm (optional) + +### Basic Deployment + +Create `nomyo-router-deployment.yaml`: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: nomyo-router + labels: + app: nomyo-router +spec: + replicas: 2 + selector: + matchLabels: + app: nomyo-router + template: + metadata: + labels: + app: nomyo-router + spec: + containers: + - name: nomyo-router + image: nomyo-router:latest + ports: + - containerPort: 12434 + env: + - name: CONFIG_PATH + value: "/app/config/config.yaml" + - name: NOMYO_ROUTER_DB_PATH + value: "/app/token_counts.db" + volumeMounts: + - name: config-volume + mountPath: /app/config + - name: db-volume + mountPath: /app/token_counts.db + subPath: token_counts.db + volumes: + - name: config-volume + configMap: + name: nomyo-router-config + - name: db-volume + persistentVolumeClaim: + claimName: nomyo-router-db-pvc +--- +apiVersion: v1 +kind: Service +metadata: + name: nomyo-router +spec: + selector: + app: nomyo-router + ports: + - protocol: TCP + port: 80 + targetPort: 12434 + type: LoadBalancer +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: nomyo-router-config +data: + config.yaml: | + endpoints: + - http://ollama-service:11434 + max_concurrent_connections: 2 +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: nomyo-router-db-pvc +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 1Gi +``` + +Apply the deployment: + +```bash +kubectl apply -f nomyo-router-deployment.yaml +``` + +### Horizontal Pod Autoscaler + +Create `nomyo-router-hpa.yaml`: + +```yaml +apiVersion: autoscaling/v2 +kind: HorizontalPodAutoscaler +metadata: + name: nomyo-router-hpa +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: nomyo-router + minReplicas: 2 + maxReplicas: 10 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 70 +``` + +Apply the HPA: + +```bash +kubectl apply -f nomyo-router-hpa.yaml +``` + +## 4. Production Deployment + +### High Availability Setup + +For production environments with multiple Ollama instances: + +```yaml +# config.yaml +endpoints: + - http://ollama-worker1:11434 + - http://ollama-worker2:11434 + - http://ollama-worker3:11434 + - https://api.openai.com/v1 + +max_concurrent_connections: 4 + +api_keys: + "https://api.openai.com/v1": "${OPENAI_KEY}" +``` + +### Load Balancing + +Deploy multiple router instances behind a load balancer: + +``` +┌───────────────────────────────────────────────────────────────┐ +│ Load Balancer (NGINX, Traefik) │ +└───────────────────────────────────────────────────────────────┘ + │ + ├─┬───────────────────────────────────────┐ + │ │ │ + ▼ ▼ ▼ +┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐ +│ Router Instance │ │ Router Instance │ │ Router Instance │ +│ (Pod 1) │ │ (Pod 2) │ │ (Pod 3) │ +└─────────────────┘ └─────────────────┘ └─────────────────────────┘ + │ + ▼ +┌───────────────────────────────────────────────────────────────┐ +│ Ollama Cluster │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────────┐ │ +│ │ Ollama │ │ Ollama │ │ OpenAI API │ │ +│ │ Worker 1 │ │ Worker 2 │ │ (Fallback) │ │ +│ └─────────────┘ └─────────────┘ └─────────────────────────────┘ │ +└───────────────────────────────────────────────────────────────┘ +``` + +### Monitoring and Logging + +#### Prometheus Monitoring + +Create a Prometheus scrape configuration: + +```yaml +scrape_configs: + - job_name: 'nomyo-router' + metrics_path: '/metrics' + static_configs: + - targets: ['nomyo-router:12434'] +``` + +#### Logging + +Configure log aggregation: + +```bash +# In Docker +docker run -d \ + --name nomyo-router \ + -p 12434:12434 \ + -v /path/to/config:/app/config/ \ + -e CONFIG_PATH=/app/config/config.yaml \ + --log-driver=fluentd \ + --log-opt fluentd-address=fluentd:24224 \ + nomyo-router +``` + +## Deployment Checklist + +### Pre-Deployment + +- [ ] Configure all Ollama endpoints +- [ ] Set appropriate `max_concurrent_connections` +- [ ] Configure API keys for remote endpoints +- [ ] Test configuration locally +- [ ] Set up monitoring and alerting +- [ ] Configure logging +- [ ] Set up backup for token counts database + +### Post-Deployment + +- [ ] Verify health endpoint: `curl http:///health` +- [ ] Check endpoint status: `curl http:///api/config` +- [ ] Monitor connection counts: `curl http:///api/usage` +- [ ] Set up regular backups +- [ ] Configure auto-restart on failure +- [ ] Monitor performance metrics + +## Scaling Guidelines + +### Vertical Scaling + +- Increase `max_concurrent_connections` for more parallel requests +- Add more CPU/memory to the router instance +- Monitor memory usage (token buffer grows with usage) + +### Horizontal Scaling + +- Deploy multiple router instances +- Use a load balancer to distribute traffic +- Each instance maintains its own connection tracking +- Database can be shared or per-instance + +### Database Considerations + +- SQLite is sufficient for single-instance deployments +- For multi-instance deployments, consider PostgreSQL +- Regular backups are recommended +- Database size grows with token usage history + +## Security Best Practices + +### Network Security + +- Use TLS for all external connections +- Restrict access to router port (12434) +- Use firewall rules to limit access +- Consider using VPN for internal communications + +### Configuration Security + +- Store API keys in environment variables +- Restrict access to config.yaml +- Use secrets management for production deployments +- Rotate API keys regularly + +### Runtime Security + +- Run as non-root user +- Set appropriate file permissions +- Monitor for suspicious activity +- Keep dependencies updated + +## Troubleshooting Deployment Issues + +### Common Issues + +**Problem**: Router not starting + +- **Check**: Logs for configuration errors +- **Solution**: Validate config.yaml syntax + +**Problem**: Endpoints showing as unavailable + +- **Check**: Network connectivity from router to endpoints +- **Solution**: Verify firewall rules and DNS resolution + +**Problem**: High latency + +- **Check**: Endpoint health and connection counts +- **Solution**: Add more endpoints or increase concurrency limits + +**Problem**: Database errors + +- **Check**: Database file permissions +- **Solution**: Ensure write permissions for the database path + +**Problem**: Connection limits being hit + +- **Check**: `/api/usage` endpoint +- **Solution**: Increase `max_concurrent_connections` or add endpoints + +## Examples + +See the [examples](examples/) directory for ready-to-use deployment examples. diff --git a/doc/examples/docker-compose.yml b/doc/examples/docker-compose.yml new file mode 100644 index 0000000..d940c73 --- /dev/null +++ b/doc/examples/docker-compose.yml @@ -0,0 +1,98 @@ +# Docker Compose example for NOMYO Router with multiple Ollama instances + +version: '3.8' + +services: + # NOMYO Router + nomyo-router: + image: nomyo-router:latest + build: . + ports: + - "12434:12434" + environment: + - CONFIG_PATH=/app/config/config.yaml + - NOMYO_ROUTER_DB_PATH=/app/token_counts.db + volumes: + - ./config:/app/config + - router-db:/app/token_counts.db + depends_on: + - ollama1 + - ollama2 + - ollama3 + restart: unless-stopped + networks: + - nomyo-net + + # Ollama Instance 1 + ollama1: + image: ollama/ollama:latest + ports: + - "11434:11434" + volumes: + - ollama1-data:/root/.ollama + environment: + - OLLAMA_NUM_PARALLEL=4 + restart: unless-stopped + networks: + - nomyo-net + + # Ollama Instance 2 + ollama2: + image: ollama/ollama:latest + ports: + - "11435:11434" + volumes: + - ollama2-data:/root/.ollama + environment: + - OLLAMA_NUM_PARALLEL=4 + restart: unless-stopped + networks: + - nomyo-net + + # Ollama Instance 3 + ollama3: + image: ollama/ollama:latest + ports: + - "11436:11434" + volumes: + - ollama3-data:/root/.ollama + environment: + - OLLAMA_NUM_PARALLEL=4 + restart: unless-stopped + networks: + - nomyo-net + + # Optional: Prometheus for monitoring + prometheus: + image: prom/prometheus:latest + ports: + - "9090:9090" + volumes: + - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml + command: + - '--config.file=/etc/prometheus/prometheus.yml' + restart: unless-stopped + networks: + - nomyo-net + + # Optional: Grafana for visualization + grafana: + image: grafana/grafana:latest + ports: + - "3000:3000" + volumes: + - grafana-storage:/var/lib/grafana + restart: unless-stopped + networks: + - nomyo-net + +volumes: + router-db: + ollama1-data: + ollama2-data: + ollama3-data: + grafana-storage: + +networks: + nomyo-net: + driver: bridge diff --git a/doc/examples/sample-config.yaml b/doc/examples/sample-config.yaml new file mode 100644 index 0000000..9fb8d94 --- /dev/null +++ b/doc/examples/sample-config.yaml @@ -0,0 +1,37 @@ +# Sample NOMYO Router Configuration + +# Basic single endpoint configuration +endpoints: + - http://localhost:11434 + +max_concurrent_connections: 2 + +# Multi-endpoint configuration with local Ollama instances +# endpoints: +# - http://ollama-worker1:11434 +# - http://ollama-worker2:11434 +# - http://ollama-worker3:11434 + +# Mixed configuration with Ollama and OpenAI endpoints +# endpoints: +# - http://localhost:11434 +# - https://api.openai.com/v1 + + +# API keys for remote endpoints +# Use ${VAR_NAME} syntax to reference environment variables +api_keys: + # Local Ollama instances typically don't require authentication + "http://localhost:11434": "ollama" + + # Remote Ollama instances + # "http://remote-ollama:11434": "ollama" + + # OpenAI API + # "https://api.openai.com/v1": "${OPENAI_KEY}" + + # Anthropic API + # "https://api.anthropic.com/v1": "${ANTHROPIC_KEY}" + + # Other OpenAI-compatible endpoints + # "https://api.mistral.ai/v1": "${MISTRAL_KEY}" diff --git a/doc/monitoring.md b/doc/monitoring.md new file mode 100644 index 0000000..7f96def --- /dev/null +++ b/doc/monitoring.md @@ -0,0 +1,515 @@ +# Monitoring and Troubleshooting Guide + +## Monitoring Overview + +NOMYO Router provides comprehensive monitoring capabilities to track performance, health, and usage patterns. + +## Monitoring Endpoints + +### Health Check + +```bash +curl http://localhost:12434/health +``` + +Response: +```json +{ + "status": "ok" | "error", + "endpoints": { + "http://endpoint1:11434": { + "status": "ok" | "error", + "version": "string" | "detail": "error message" + } + } +} +``` + +**HTTP Status Codes**: +- `200`: All endpoints healthy +- `503`: One or more endpoints unhealthy + +### Current Usage + +```bash +curl http://localhost:12434/api/usage +``` + +Response: +```json +{ + "usage_counts": { + "http://endpoint1:11434": { + "llama3": 2, + "mistral": 1 + }, + "http://endpoint2:11434": { + "llama3": 0, + "mistral": 3 + } + }, + "token_usage_counts": { + "http://endpoint1:11434": { + "llama3": 1542, + "mistral": 876 + } + } +} +``` + +### Token Statistics + +```bash +curl http://localhost:12434/api/token_counts +``` + +Response: +```json +{ + "total_tokens": 2418, + "breakdown": [ + { + "endpoint": "http://endpoint1:11434", + "model": "llama3", + "input_tokens": 120, + "output_tokens": 1422, + "total_tokens": 1542 + }, + { + "endpoint": "http://endpoint1:11434", + "model": "mistral", + "input_tokens": 80, + "output_tokens": 796, + "total_tokens": 876 + } + ] +} +``` + +### Model Statistics + +```bash +curl http://localhost:12434/api/stats -X POST -d '{"model": "llama3"}' +``` + +Response: +```json +{ + "model": "llama3", + "input_tokens": 120, + "output_tokens": 1422, + "total_tokens": 1542, + "time_series": [ + { + "endpoint": "http://endpoint1:11434", + "timestamp": 1712345600, + "input_tokens": 20, + "output_tokens": 150, + "total_tokens": 170 + } + ], + "endpoint_distribution": { + "http://endpoint1:11434": 1542 + } +} +``` + +### Configuration Status + +```bash +curl http://localhost:12434/api/config +``` + +Response: +```json +{ + "endpoints": [ + { + "url": "http://endpoint1:11434", + "status": "ok" | "error", + "version": "string" | "detail": "error message" + } + ] +} +``` + +### Real-time Usage Stream + +```bash +curl http://localhost:12434/api/usage-stream +``` + +This provides Server-Sent Events (SSE) with real-time updates: +``` +data: {"usage_counts": {...}, "token_usage_counts": {...}} + +data: {"usage_counts": {...}, "token_usage_counts": {...}} +``` + +## Monitoring Tools + +### Prometheus Integration + +Create a Prometheus scrape configuration: + +```yaml +scrape_configs: + - job_name: 'nomyo-router' + metrics_path: '/api/usage' + params: + format: ['prometheus'] + static_configs: + - targets: ['nomyo-router:12434'] +``` + +### Grafana Dashboard + +Create a dashboard with these panels: +- Endpoint health status +- Current connection counts +- Token usage (input/output/total) +- Request rates +- Response times +- Error rates + +### Logging + +The router logs important events to stdout: +- Configuration loading +- Endpoint connection issues +- Token counting operations +- Error conditions + +## Troubleshooting Guide + +### Common Issues and Solutions + +#### 1. Endpoint Unavailable + +**Symptoms**: +- Health check shows endpoint as "error" +- Requests fail with connection errors + +**Diagnosis**: +```bash +curl http://localhost:12434/health +curl http://localhost:12434/api/config +``` + +**Solutions**: +- Verify Ollama endpoint is running +- Check network connectivity +- Verify firewall rules +- Check DNS resolution +- Test direct connection: `curl http://endpoint:11434/api/version` + +#### 2. Model Not Found + +**Symptoms**: +- Error: "None of the configured endpoints advertise the model" +- Requests fail with model not found + +**Diagnosis**: +```bash +curl http://localhost:12434/api/tags +curl http://endpoint:11434/api/tags +``` + +**Solutions**: +- Pull the model on the endpoint: `curl http://endpoint:11434/api/pull -d '{"name": "llama3"}'` +- Verify model name spelling +- Check if model is available on any endpoint +- For OpenAI endpoints, ensure model exists in their catalog + +#### 3. High Latency + +**Symptoms**: +- Slow response times +- Requests timing out + +**Diagnosis**: +```bash +curl http://localhost:12434/api/usage +curl http://localhost:12434/api/config +``` + +**Solutions**: +- Check if endpoints are overloaded (high connection counts) +- Increase `max_concurrent_connections` +- Add more endpoints to the cluster +- Monitor Ollama endpoint performance +- Check network latency between router and endpoints + +#### 4. Connection Limits Reached + +**Symptoms**: +- Requests queuing +- High connection counts +- Slow response times + +**Diagnosis**: +```bash +curl http://localhost:12434/api/usage +``` + +**Solutions**: +- Increase `max_concurrent_connections` in config.yaml +- Add more Ollama endpoints +- Scale your Ollama cluster +- Use MOE system for critical queries + +#### 5. Token Tracking Not Working + +**Symptoms**: +- Token counts not updating +- Database errors + +**Diagnosis**: +```bash +ls -la token_counts.db +curl http://localhost:12434/api/token_counts +``` + +**Solutions**: +- Verify database file permissions +- Check if database path is writable +- Restart router to rebuild database +- Check disk space +- Verify environment variable `NOMYO_ROUTER_DB_PATH` + +#### 6. Streaming Issues + +**Symptoms**: +- Incomplete responses +- Connection resets during streaming +- Timeout errors + +**Diagnosis**: +- Check router logs for errors +- Test with non-streaming requests +- Monitor connection counts + +**Solutions**: +- Increase timeout settings +- Reduce `max_concurrent_connections` +- Check network stability +- Test with smaller payloads + +### Error Messages + +#### "Failed to connect to endpoint" + +**Cause**: Network connectivity issue +**Action**: Verify endpoint is reachable, check firewall, test DNS + +#### "None of the configured endpoints advertise the model" + +**Cause**: Model not pulled on any endpoint +**Action**: Pull the model, verify model name + +#### "Timed out waiting for endpoint" + +**Cause**: Endpoint slow to respond +**Action**: Check endpoint health, increase timeouts + +#### "Invalid JSON format in request body" + +**Cause**: Malformed request +**Action**: Validate request payload, check API documentation + +#### "Missing required field 'model'" + +**Cause**: Request missing model parameter +**Action**: Add model parameter to request + +### Performance Tuning + +#### Optimizing Connection Handling + +1. **Adjust concurrency limits**: + ```yaml + max_concurrent_connections: 4 + ``` + +2. **Monitor connection usage**: + ```bash + curl http://localhost:12434/api/usage + ``` + +3. **Scale horizontally**: + - Add more Ollama endpoints + - Deploy multiple router instances + +#### Reducing Latency + +1. **Keep models loaded**: + - Use frequently accessed models + - Monitor `/api/ps` for loaded models + +2. **Optimize endpoint selection**: + - Distribute models across endpoints + - Balance load evenly + +3. **Use caching**: + - Models cache (300s TTL) + - Loaded models cache (30s TTL) + +#### Memory Management + +1. **Monitor memory usage**: + - Token buffer grows with usage + - Time-series data accumulates + +2. **Adjust flush interval**: + - Default: 10 seconds + - Can be increased for less frequent I/O + +3. **Database maintenance**: + - Regular backups + - Archive old data periodically + +### Database Management + +#### Viewing Token Data + +```bash +sqlite3 token_counts.db "SELECT * FROM token_counts;" +sqlite3 token_counts.db "SELECT * FROM time_series LIMIT 100;" +``` + +#### Aggregating Old Data + +```bash +curl http://localhost:12434/api/aggregate_time_series_days \ + -X POST \ + -d '{"days": 30, "trim_old": true}' +``` + +#### Backing Up Database + +```bash +cp token_counts.db token_counts.db.backup +``` + +#### Restoring Database + +```bash +cp token_counts.db.backup token_counts.db +``` + +### Advanced Troubleshooting + +#### Debugging Endpoint Selection + +Enable debug logging to see endpoint selection decisions: +```bash +uvicorn router:app --host 0.0.0.0 --port 12434 --log-level debug +``` + +#### Testing Individual Endpoints + +```bash +# Test endpoint directly +curl http://endpoint:11434/api/version + +# Test model availability +curl http://endpoint:11434/api/tags + +# Test model loading +curl http://endpoint:11434/api/ps +``` + +#### Network Diagnostics + +```bash +# Test connectivity +nc -zv endpoint 11434 + +# Test DNS resolution +dig endpoint + +# Test latency +ping endpoint +``` + +### Common Pitfalls + +1. **Using localhost in Docker**: + - Inside Docker, `localhost` refers to the container itself + - Use `host.docker.internal` or Docker service names + +2. **Incorrect model names**: + - Ollama: `llama3:latest` + - OpenAI: `gpt-4` (no version suffix) + +3. **Missing API keys**: + - Remote endpoints require authentication + - Set keys in config.yaml or environment variables + +4. **Firewall blocking**: + - Ensure port 11434 is open for Ollama + - Ensure port 12434 is open for router + +5. **Insufficient resources**: + - Monitor CPU/memory on Ollama endpoints + - Adjust `max_concurrent_connections` accordingly + +## Best Practices + +### Monitoring Setup + +1. **Set up health checks**: + - Monitor `/health` endpoint + - Alert on status "error" + +2. **Track usage metrics**: + - Monitor connection counts + - Track token usage + - Watch for connection limits + +3. **Log important events**: + - Configuration changes + - Endpoint failures + - Recovery events + +4. **Regular backups**: + - Backup token_counts.db + - Schedule regular backups + - Test restore procedure + +### Performance Monitoring + +1. **Baseline metrics**: + - Establish normal usage patterns + - Track trends over time + +2. **Alert thresholds**: + - Set alerts for high connection counts + - Monitor error rates + - Watch for latency spikes + +3. **Capacity planning**: + - Track growth in token usage + - Plan for scaling needs + - Monitor resource utilization + +### Incident Response + +1. **Quick diagnosis**: + - Check health endpoint first + - Review recent logs + - Verify endpoint status + +2. **Isolation**: + - Identify affected endpoints + - Isolate problematic components + - Fallback to healthy endpoints + +3. **Recovery**: + - Restart router if needed + - Rebalance load + - Restore from backup if necessary + +## Examples + +See the [examples](examples/) directory for monitoring configuration examples. diff --git a/doc/usage.md b/doc/usage.md new file mode 100644 index 0000000..2e5cda2 --- /dev/null +++ b/doc/usage.md @@ -0,0 +1,348 @@ +# Usage Guide + +## Quick Start + +### 1. Install the Router + +```bash +git clone https://github.com/nomyo-ai/nomyo-router.git +cd nomyo-router +python3 -m venv .venv/router +source .venv/router/bin/activate +pip3 install -r requirements.txt +``` + +### 2. Configure Endpoints + +Edit `config.yaml`: + +```yaml +endpoints: + - http://localhost:11434 + +max_concurrent_connections: 2 +``` + +### 3. Run the Router + +```bash +uvicorn router:app --host 0.0.0.0 --port 12434 +``` + +### 4. Use the Router + +Configure your frontend to point to `http://localhost:12434` instead of your Ollama instance. + +## API Endpoints + +### Ollama-Compatible Endpoints + +The router provides all standard Ollama API endpoints: + +| Endpoint | Method | Description | +| --------------- | ------ | --------------------- | +| `/api/generate` | POST | Generate text | +| `/api/chat` | POST | Chat completions | +| `/api/embed` | POST | Embeddings | +| `/api/tags` | GET | List available models | +| `/api/ps` | GET | List loaded models | +| `/api/show` | POST | Show model details | +| `/api/pull` | POST | Pull a model | +| `/api/push` | POST | Push a model | +| `/api/create` | POST | Create a model | +| `/api/copy` | POST | Copy a model | +| `/api/delete` | DELETE | Delete a model | + +### OpenAI-Compatible Endpoints + +For OpenAI API compatibility: + +| Endpoint | Method | Description | +| ---------------------- | ------ | ---------------- | +| `/v1/chat/completions` | POST | Chat completions | +| `/v1/completions` | POST | Text completions | +| `/v1/embeddings` | POST | Embeddings | +| `/v1/models` | GET | List models | + +### Monitoring Endpoints + +| Endpoint | Method | Description | +| ---------------------------------- | ------ | ---------------------------------------- | +| `/api/usage` | GET | Current connection counts | +| `/api/token_counts` | GET | Token usage statistics | +| `/api/stats` | POST | Detailed model statistics | +| `/api/aggregate_time_series_days` | POST | Aggregate time series data into daily | +| `/api/version` | GET | Ollama version info | +| `/api/config` | GET | Endpoint configuration | +| `/api/usage-stream` | GET | Real-time usage updates (SSE) | +| `/health` | GET | Health check | + +## Making Requests + +### Basic Chat Request + +```bash +curl http://localhost:12434/api/chat \ + -H "Content-Type: application/json" \ + -d '{ + "model": "llama3:latest", + "messages": [ + {"role": "user", "content": "Hello, how are you?"} + ], + "stream": false + }' +``` + +### Streaming Response + +```bash +curl http://localhost:12434/api/chat \ + -H "Content-Type: application/json" \ + -d '{ + "model": "llama3:latest", + "messages": [ + {"role": "user", "content": "Tell me a story"} + ], + "stream": true + }' +``` + +### OpenAI API Format + +```bash +curl http://localhost:12434/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "gpt-4o-nano", + "messages": [ + {"role": "user", "content": "Hello"} + ] + }' +``` + +## Advanced Features + +### Multiple Opinions Ensemble (MOE) + +Prefix your model name with `moe-` to enable the MOE system: + +```bash +curl http://localhost:12434/api/chat \ + -H "Content-Type: application/json" \ + -d '{ + "model": "moe-llama3:latest", + "messages": [ + {"role": "user", "content": "Explain quantum computing"} + ] + }' +``` + +The MOE system: + +1. Generates 3 responses from different endpoints +2. Generates 3 critiques of those responses +3. Selects the best response +4. Generates a final refined response + +### Token Tracking + +The router automatically tracks token usage: + +```bash +curl http://localhost:12434/api/token_counts +``` + +Response: + +```json +{ + "total_tokens": 1542, + "breakdown": [ + { + "endpoint": "http://localhost:11434", + "model": "llama3", + "input_tokens": 120, + "output_tokens": 1422, + "total_tokens": 1542 + } + ] +} +``` + +### Real-time Monitoring + +Use Server-Sent Events to monitor usage in real-time: + +```bash +curl http://localhost:12434/api/usage-stream +``` + +## Integration Examples + +### Python Client + +```python +import requests + +url = "http://localhost:12434/api/chat" +data = { + "model": "llama3", + "messages": [{"role": "user", "content": "What is AI?"}], + "stream": False +} + +response = requests.post(url, json=data) +print(response.json()) +``` + +### JavaScript Client + +```javascript +const response = await fetch('http://localhost:12434/api/chat', { + method: 'POST', + headers: { + 'Content-Type': 'application/json', + }, + body: JSON.stringify({ + model: 'llama3', + messages: [{ role: 'user', content: 'Hello!' }], + stream: false + }) +}); + +const data = await response.json(); +console.log(data); +``` + +### Streaming with JavaScript + +```javascript +const eventSource = new EventSource('http://localhost:12434/api/usage-stream'); + +eventSource.onmessage = (event) => { + const usage = JSON.parse(event.data); + console.log('Current usage:', usage); +}; +``` + +## Python Ollama Client + +```python +from ollama import Client + +# Configure the client to use the router +client = Client(host='http://localhost:12434') + +# Chat with a model +response = client.chat( + model='llama3:latest', + messages=[ + {'role': 'user', 'content': 'Explain quantum computing'} + ] +) +print(response['message']['content']) + +# Generate text +response = client.generate( + model='llama3:latest', + prompt='Write a short poem about AI' +) +print(response['response']) + +# List available models +models = client.list()['models'] +print(f"Available models: {[m['name'] for m in models]}") +``` + +### Python OpenAI Client + +```python +from openai import OpenAI + +# Configure the client to use the router +client = OpenAI( + base_url='http://localhost:12434/v1', + api_key='not-needed' # API key is not required for local usage +) + +# Chat completions +response = client.chat.completions.create( + model='gpt-4o-nano', + messages=[ + {'role': 'user', 'content': 'What is the meaning of life?'} + ] +) +print(response.choices[0].message.content) + +# Text completions +response = client.completions.create( + model='llama3:latest', + prompt='Once upon a time' +) +print(response.choices[0].text) + +# Embeddings +response = client.embeddings.create( + model='llama3:latest', + input='The quick brown fox jumps over the lazy dog' +) +print(f"Embedding length: {len(response.data[0].embedding)}") + +# List models +response = client.models.list() +print(f"Available models: {[m.id for m in response.data]}") +``` + +## Best Practices + +### 1. Model Selection + +- Use the same model name across all endpoints +- For Ollama, append `:latest` or a specific version tag +- For OpenAI endpoints, use the model name without version + +### 2. Connection Management + +- Set `max_concurrent_connections` appropriately for your hardware +- Monitor `/api/usage` to ensure endpoints aren't overloaded +- Consider using the MOE system for critical queries + +### 3. Error Handling + +- Check the `/health` endpoint regularly +- Implement retry logic with exponential backoff +- Monitor error rates and connection failures + +### 4. Performance + +- Keep frequently used models loaded on multiple endpoints +- Use streaming for large responses +- Monitor token usage to optimize costs + +## Troubleshooting + +### Common Issues + +**Problem**: Model not found + +- **Solution**: Ensure the model is pulled on at least one endpoint +- **Check**: `curl http://localhost:12434/api/tags` + +**Problem**: Connection refused + +- **Solution**: Verify Ollama endpoints are running and accessible +- **Check**: `curl http://localhost:12434/health` + +**Problem**: High latency + +- **Solution**: Check endpoint health and connection counts +- **Check**: `curl http://localhost:12434/api/usage` + +**Problem**: Token tracking not working + +- **Solution**: Ensure database path is writable +- **Check**: `ls -la token_counts.db` + +## Examples + +See the [examples](examples/) directory for complete integration examples. diff --git a/router.py b/router.py index fb3a0fb..cb5af1f 100644 --- a/router.py +++ b/router.py @@ -9,6 +9,9 @@ license: AGPL import orjson, time, asyncio, yaml, ollama, openai, os, re, aiohttp, ssl, random, base64, io, enhance from datetime import datetime, timezone from pathlib import Path + +# Directory containing static files (relative to this script) +STATIC_DIR = Path(__file__).parent / "static" from typing import Dict, Set, List, Optional from urllib.parse import urlparse from fastapi import FastAPI, Request, HTTPException @@ -55,6 +58,8 @@ flush_task: asyncio.Task | None = None token_buffer: dict[str, dict[str, tuple[int, int]]] = defaultdict(lambda: defaultdict(lambda: (0, 0))) # Time series buffer with timestamp time_series_buffer: list[dict[str, int | str]] = [] +# Lock to protect buffer access from race conditions +buffer_lock = asyncio.Lock() # Configuration for periodic flushing FLUSH_INTERVAL = 10 # seconds @@ -218,45 +223,112 @@ def is_ext_openai_endpoint(endpoint: str) -> bool: return True # It's an external OpenAI endpoint async def token_worker() -> None: - while True: - endpoint, model, prompt, comp = await token_queue.get() - # Accumulate counts in memory buffer - token_buffer[endpoint][model] = ( - token_buffer[endpoint].get(model, (0, 0))[0] + prompt, - token_buffer[endpoint].get(model, (0, 0))[1] + comp - ) + try: + while True: + endpoint, model, prompt, comp = await token_queue.get() + # Accumulate counts in memory buffer (protected by lock) + async with buffer_lock: + token_buffer[endpoint][model] = ( + token_buffer[endpoint].get(model, (0, 0))[0] + prompt, + token_buffer[endpoint].get(model, (0, 0))[1] + comp + ) - # Add to time series buffer with timestamp (UTC) - now = datetime.now(tz=timezone.utc) - timestamp = int(datetime(now.year, now.month, now.day, now.hour, now.minute, tzinfo=timezone.utc).timestamp()) - time_series_buffer.append({ - 'endpoint': endpoint, - 'model': model, - 'input_tokens': prompt, - 'output_tokens': comp, - 'total_tokens': prompt + comp, - 'timestamp': timestamp - }) + # Add to time series buffer with timestamp (UTC) + now = datetime.now(tz=timezone.utc) + timestamp = int(datetime(now.year, now.month, now.day, now.hour, now.minute, tzinfo=timezone.utc).timestamp()) + time_series_buffer.append({ + 'endpoint': endpoint, + 'model': model, + 'input_tokens': prompt, + 'output_tokens': comp, + 'total_tokens': prompt + comp, + 'timestamp': timestamp + }) - # Update in-memory counts for immediate reporting - async with token_usage_lock: - token_usage_counts[endpoint][model] += (prompt + comp) - await publish_snapshot() + # Update in-memory counts for immediate reporting + async with token_usage_lock: + token_usage_counts[endpoint][model] += (prompt + comp) + await publish_snapshot() + except asyncio.CancelledError: + # Gracefully handle task cancellation during shutdown + print("[token_worker] Task cancelled, processing remaining queue items...") + # Process any remaining items in the queue before exiting + while not token_queue.empty(): + try: + endpoint, model, prompt, comp = token_queue.get_nowait() + async with buffer_lock: + token_buffer[endpoint][model] = ( + token_buffer[endpoint].get(model, (0, 0))[0] + prompt, + token_buffer[endpoint].get(model, (0, 0))[1] + comp + ) + now = datetime.now(tz=timezone.utc) + timestamp = int(datetime(now.year, now.month, now.day, now.hour, now.minute, tzinfo=timezone.utc).timestamp()) + time_series_buffer.append({ + 'endpoint': endpoint, + 'model': model, + 'input_tokens': prompt, + 'output_tokens': comp, + 'total_tokens': prompt + comp, + 'timestamp': timestamp + }) + async with token_usage_lock: + token_usage_counts[endpoint][model] += (prompt + comp) + await publish_snapshot() + except asyncio.QueueEmpty: + break + print("[token_worker] Task cancelled, remaining items processed.") + raise async def flush_buffer() -> None: """Periodically flush accumulated token counts to the database.""" - while True: - await asyncio.sleep(FLUSH_INTERVAL) + try: + while True: + await asyncio.sleep(FLUSH_INTERVAL) - # Flush token counts - if token_buffer: - await db.update_batched_counts(token_buffer) - token_buffer.clear() + # Flush token counts and time series (protected by lock) + async with buffer_lock: + if token_buffer: + # Copy buffer before releasing lock for DB operation + buffer_copy = {ep: dict(models) for ep, models in token_buffer.items()} + token_buffer.clear() + else: + buffer_copy = None - # Flush time series entries - if time_series_buffer: - await db.add_batched_time_series(time_series_buffer) - time_series_buffer.clear() + if time_series_buffer: + ts_copy = list(time_series_buffer) + time_series_buffer.clear() + else: + ts_copy = None + + # Perform DB operations outside the lock to avoid blocking + if buffer_copy: + await db.update_batched_counts(buffer_copy) + if ts_copy: + await db.add_batched_time_series(ts_copy) + except asyncio.CancelledError: + # Gracefully handle task cancellation during shutdown + print("[flush_buffer] Task cancelled, flushing remaining buffers...") + # Flush any remaining data before exiting + try: + async with buffer_lock: + if token_buffer: + buffer_copy = {ep: dict(models) for ep, models in token_buffer.items()} + token_buffer.clear() + else: + buffer_copy = None + if time_series_buffer: + ts_copy = list(time_series_buffer) + time_series_buffer.clear() + else: + ts_copy = None + if buffer_copy: + await db.update_batched_counts(buffer_copy) + if ts_copy: + await db.add_batched_time_series(ts_copy) + print("[flush_buffer] Task cancelled, remaining buffers flushed.") + except Exception as e: + print(f"[flush_buffer] Error during shutdown flush: {e}") + raise async def flush_remaining_buffers() -> None: """ @@ -265,14 +337,24 @@ async def flush_remaining_buffers() -> None: """ try: flushed_entries = 0 - if token_buffer: - await db.update_batched_counts(token_buffer) - flushed_entries += sum(len(v) for v in token_buffer.values()) - token_buffer.clear() - if time_series_buffer: - await db.add_batched_time_series(time_series_buffer) - flushed_entries += len(time_series_buffer) - time_series_buffer.clear() + async with buffer_lock: + if token_buffer: + buffer_copy = {ep: dict(models) for ep, models in token_buffer.items()} + flushed_entries += sum(len(v) for v in token_buffer.values()) + token_buffer.clear() + else: + buffer_copy = None + if time_series_buffer: + ts_copy = list(time_series_buffer) + flushed_entries += len(time_series_buffer) + time_series_buffer.clear() + else: + ts_copy = None + # Perform DB operations outside the lock + if buffer_copy: + await db.update_batched_counts(buffer_copy) + if ts_copy: + await db.add_batched_time_series(ts_copy) if flushed_entries: print(f"[shutdown] Flushed {flushed_entries} in-memory entries to DB on shutdown.") else: @@ -884,7 +966,9 @@ async def choose_endpoint(model: str) -> str: ] if endpoints_with_free_slot: - return random.choice(endpoints_with_free_slot) + #return random.choice(endpoints_with_free_slot) + endpoints_with_free_slot.sort(key=lambda ep: sum(usage_counts.get(ep, {}).values())) + return endpoints_with_free_slot[0] # 5️⃣ All candidate endpoints are saturated – pick one with lowest usages count (will queue) ep = min(candidate_endpoints, key=current_usage) @@ -2053,7 +2137,13 @@ async def index(request: Request): Render the dynamic NOMYO Router dashboard listing the configured endpoints and the models details, availability & task status. """ - return HTMLResponse(content=open("static/index.html", "r").read(), status_code=200) + index_path = STATIC_DIR / "index.html" + try: + return HTMLResponse(content=index_path.read_text(encoding="utf-8"), status_code=200) + except FileNotFoundError: + raise HTTPException(status_code=404, detail="Page not found") + except Exception: + raise HTTPException(status_code=500, detail="Internal server error") # ------------------------------------------------------------- # 26. Healthendpoint