Merge pull request #19 from nomyo-ai/dev-v0.5.x
feat: buffer_locks preventing race conditions in high concurrency scenarios documentation folder
This commit is contained in:
commit
6828411f95
9 changed files with 2167 additions and 42 deletions
137
doc/README.md
Normal file
137
doc/README.md
Normal file
|
|
@ -0,0 +1,137 @@
|
|||
# NOMYO Router Documentation
|
||||
|
||||
Welcome to the NOMYO Router documentation! This folder contains comprehensive guides for using, configuring, and deploying the NOMYO Router.
|
||||
|
||||
## Documentation Structure
|
||||
|
||||
```
|
||||
doc/
|
||||
├── architecture.md # Technical architecture overview
|
||||
├── configuration.md # Detailed configuration guide
|
||||
├── usage.md # API usage examples
|
||||
├── deployment.md # Deployment scenarios
|
||||
├── monitoring.md # Monitoring and troubleshooting
|
||||
└── examples/ # Example configurations and scripts
|
||||
├── docker-compose.yml
|
||||
├── sample-config.yaml
|
||||
└── k8s-deployment.yaml
|
||||
```
|
||||
|
||||
## Getting Started
|
||||
|
||||
### Quick Start Guide
|
||||
|
||||
1. **Install the router**:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/nomyo-ai/nomyo-router.git
|
||||
cd nomyo-router
|
||||
python3 -m venv .venv/router
|
||||
source .venv/router/bin/activate
|
||||
pip3 install -r requirements.txt
|
||||
```
|
||||
2. **Configure endpoints** in `config.yaml`:
|
||||
|
||||
```yaml
|
||||
endpoints:
|
||||
- http://localhost:11434
|
||||
max_concurrent_connections: 2
|
||||
```
|
||||
3. **Run the router**:
|
||||
|
||||
```bash
|
||||
uvicorn router:app --host 0.0.0.0 --port 12434
|
||||
```
|
||||
4. **Use the router**: Point your frontend to `http://localhost:12434` instead of your Ollama instance.
|
||||
|
||||
### Key Features
|
||||
|
||||
- **Intelligent Routing**: Model deployment-aware routing with load balancing
|
||||
- **Multi-Endpoint Support**: Combine Ollama and OpenAI-compatible endpoints
|
||||
- **Token Tracking**: Comprehensive token usage monitoring
|
||||
- **Real-time Monitoring**: Server-Sent Events for live usage updates
|
||||
- **OpenAI Compatibility**: Full OpenAI API compatibility layer
|
||||
- **MOE System**: Multiple Opinions Ensemble for improved responses with smaller models
|
||||
|
||||
## Documentation Guides
|
||||
|
||||
### [Architecture](architecture.md)
|
||||
|
||||
Learn about the router's internal architecture, routing algorithm, caching mechanisms, and advanced features like the MOE system.
|
||||
|
||||
### [Configuration](configuration.md)
|
||||
|
||||
Detailed guide on configuring the router with multiple endpoints, API keys, and environment variables.
|
||||
|
||||
### [Usage](usage.md)
|
||||
|
||||
Comprehensive API reference with examples for making requests, streaming responses, and using advanced features.
|
||||
|
||||
### [Deployment](deployment.md)
|
||||
|
||||
Step-by-step deployment guides for bare metal, Docker, Kubernetes, and production environments.
|
||||
|
||||
### [Monitoring](monitoring.md)
|
||||
|
||||
Monitoring endpoints, troubleshooting guides, performance tuning, and best practices for maintaining your router.
|
||||
|
||||
## Examples
|
||||
|
||||
The [examples](examples/) directory contains ready-to-use configuration files:
|
||||
|
||||
- **docker-compose.yml**: Complete Docker Compose setup with multiple Ollama instances
|
||||
- **sample-config.yaml**: Example configuration with comments
|
||||
- **k8s-deployment.yaml**: Kubernetes deployment manifests
|
||||
|
||||
## Need Help?
|
||||
|
||||
### Common Issues
|
||||
|
||||
Check the [Monitoring Guide](monitoring.md) for troubleshooting common problems:
|
||||
|
||||
- Endpoint unavailable
|
||||
- Model not found
|
||||
- High latency
|
||||
- Connection limits reached
|
||||
- Token tracking issues
|
||||
|
||||
### Support
|
||||
|
||||
For additional help:
|
||||
|
||||
1. Check the [GitHub Issues](https://github.com/nomyo-ai/nomyo-router/issues)
|
||||
2. Review the [Monitoring Guide](monitoring.md) for diagnostics
|
||||
3. Examine the router logs for detailed error messages
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Configuration
|
||||
|
||||
- Use environment variables for API keys
|
||||
- Set appropriate `max_concurrent_connections` based on your hardware
|
||||
- Monitor endpoint health regularly
|
||||
- Keep models loaded on multiple endpoints for redundancy
|
||||
|
||||
### Deployment
|
||||
|
||||
- Use Docker for containerized deployments
|
||||
- Consider Kubernetes for production at scale
|
||||
- Set up monitoring and alerting
|
||||
- Implement regular backups of token counts database
|
||||
|
||||
### Performance
|
||||
|
||||
- Balance load across multiple endpoints
|
||||
- Keep frequently used models loaded
|
||||
- Monitor connection counts and token usage
|
||||
- Scale horizontally when needed
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Read the [Architecture Guide](architecture.md)** to understand how the router works
|
||||
2. **Configure your endpoints** in `config.yaml`
|
||||
3. **Deploy the router** using your preferred method
|
||||
4. **Monitor your setup** using the monitoring endpoints
|
||||
5. **Scale as needed** by adding more endpoints
|
||||
|
||||
Happy routing! 🚀
|
||||
259
doc/architecture.md
Normal file
259
doc/architecture.md
Normal file
|
|
@ -0,0 +1,259 @@
|
|||
# NOMYO Router Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
NOMYO Router is a transparent proxy for Ollama with model deployment-aware routing. It sits between your frontend application and Ollama backend(s), providing intelligent request routing based on model availability and load balancing.
|
||||
|
||||
## Core Components
|
||||
|
||||
### 1. Request Routing Engine
|
||||
|
||||
The router's core intelligence is in the `choose_endpoint()` function, which implements a sophisticated routing algorithm:
|
||||
|
||||
```python
|
||||
async def choose_endpoint(model: str) -> str:
|
||||
"""
|
||||
Endpoint selection algorithm:
|
||||
1. Query all endpoints for advertised models
|
||||
2. Filter endpoints that advertise the requested model
|
||||
3. Among candidates, find those with the model loaded AND free slots
|
||||
4. If none loaded with free slots, pick any with free slots
|
||||
5. If all saturated, pick endpoint with lowest current usage
|
||||
6. If no endpoint advertises the model, raise error
|
||||
"""
|
||||
```
|
||||
|
||||
### 2. Connection Tracking
|
||||
|
||||
The router maintains real-time connection counts per endpoint-model pair:
|
||||
|
||||
```python
|
||||
usage_counts: Dict[str, Dict[str, int]] = defaultdict(lambda: defaultdict(int))
|
||||
```
|
||||
|
||||
This allows for:
|
||||
|
||||
- **Load-aware routing**: Requests are routed to endpoints with available capacity
|
||||
- **Model-aware routing**: Requests are routed to endpoints where the model is already loaded
|
||||
- **Efficient resource utilization**: Minimizes model loading/unloading operations
|
||||
|
||||
### 3. Caching Layer
|
||||
|
||||
Three types of caches improve performance:
|
||||
|
||||
- **Models cache** (`_models_cache`): Caches available models per endpoint (300s TTL)
|
||||
- **Loaded models cache** (`_loaded_models_cache`): Caches currently loaded models (30s TTL)
|
||||
- **Error cache** (`_error_cache`): Caches transient errors (10s TTL)
|
||||
|
||||
### 4. Token Tracking System
|
||||
|
||||
Comprehensive token usage tracking:
|
||||
|
||||
```python
|
||||
token_buffer: dict[str, dict[str, tuple[int, int]]] = defaultdict(lambda: defaultdict(lambda: (0, 0)))
|
||||
time_series_buffer: list[dict[str, int | str]] = []
|
||||
```
|
||||
|
||||
Features:
|
||||
|
||||
- Real-time token counting for input/output tokens
|
||||
- Periodic flushing to SQLite database (every 10 seconds)
|
||||
- Time-series data for historical analysis
|
||||
- Per-endpoint, per-model breakdown
|
||||
|
||||
### 5. API Compatibility Layer
|
||||
|
||||
The router supports multiple API formats:
|
||||
|
||||
- **Ollama API**: Native `/api/generate`, `/api/chat`, `/api/embed` endpoints
|
||||
- **OpenAI API**: Compatible `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings` endpoints
|
||||
- **Transparent conversion**: Responses are converted between formats as needed
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Request Processing
|
||||
|
||||
1. **Ingress**: Frontend sends request to router
|
||||
2. **Endpoint Selection**: Router determines optimal endpoint
|
||||
3. **Request Forwarding**: Request sent to selected Ollama endpoint
|
||||
4. **Response Streaming**: Response streamed back to frontend
|
||||
5. **Usage Tracking**: Connection and token counts updated
|
||||
6. **Egress**: Complete response returned to frontend
|
||||
|
||||
### Connection Management
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Frontend
|
||||
participant Router
|
||||
participant Endpoint1
|
||||
participant Endpoint2
|
||||
|
||||
Frontend->>Router: Request for model X
|
||||
Router->>Endpoint1: Check if model X is loaded
|
||||
Router->>Endpoint2: Check if model X is loaded
|
||||
alt Endpoint1 has model X loaded
|
||||
Router->>Endpoint1: Forward request
|
||||
Endpoint1->>Router: Stream response
|
||||
Router->>Frontend: Stream response
|
||||
else Endpoint2 has model X loaded
|
||||
Router->>Endpoint2: Forward request
|
||||
Endpoint2->>Router: Stream response
|
||||
Router->>Frontend: Stream response
|
||||
else No endpoint has model X loaded
|
||||
Router->>Endpoint1: Forward request (will trigger load)
|
||||
Endpoint1->>Router: Stream response
|
||||
Router->>Frontend: Stream response
|
||||
end
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Multiple Opinions Ensemble (MOE)
|
||||
|
||||
When the user prefixes a model name with `moe-`, the router activates the MOE system:
|
||||
|
||||
1. Generates 3 responses from different endpoints
|
||||
2. Generates 3 critiques of those responses
|
||||
3. Selects the best response based on critiques
|
||||
4. Generates final refined response
|
||||
|
||||
### OpenAI Endpoint Support
|
||||
|
||||
The router can proxy requests to OpenAI-compatible endpoints alongside Ollama endpoints. It automatically:
|
||||
|
||||
- Detects OpenAI endpoints (those containing `/v1`)
|
||||
- Converts between Ollama and OpenAI response formats
|
||||
- Handles authentication with API keys
|
||||
- Maintains consistent behavior across endpoint types
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Concurrency Model
|
||||
|
||||
- **Max concurrent connections**: Configurable per endpoint-model pair
|
||||
- **Connection pooling**: Reuses aiohttp connections
|
||||
- **Async I/O**: All operations are non-blocking
|
||||
- **Backpressure handling**: Queues requests when endpoints are saturated
|
||||
|
||||
### Caching Strategy
|
||||
|
||||
- **Short TTL for loaded models** (30s): Ensures quick detection of model loading/unloading
|
||||
- **Longer TTL for available models** (300s): Reduces unnecessary API calls
|
||||
- **Error caching** (10s): Prevents thundering herd during outages
|
||||
|
||||
### Memory Management
|
||||
|
||||
- **Write-behind pattern**: Token counts buffered in memory, flushed periodically
|
||||
- **Queue-based SSE**: Server-Sent Events use bounded queues to prevent memory bloat
|
||||
- **Automatic cleanup**: Zero connection counts are removed from tracking
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Transient Errors
|
||||
|
||||
- Temporary connection failures are cached for 10 seconds
|
||||
- During cache period, endpoint is treated as unavailable
|
||||
- After cache expires, endpoint is re-tested
|
||||
|
||||
### Permanent Errors
|
||||
|
||||
- Invalid model names result in clear error messages
|
||||
- Missing required fields return 400 Bad Request
|
||||
- Unreachable endpoints are reported with detailed connection issues
|
||||
|
||||
### Health Monitoring
|
||||
|
||||
The `/health` endpoint provides comprehensive health status:
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "ok" | "error",
|
||||
"endpoints": {
|
||||
"http://endpoint1:11434": {
|
||||
"status": "ok" | "error",
|
||||
"version": "string" | "detail": "error message"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Database Schema
|
||||
|
||||
The router uses SQLite for persistent storage:
|
||||
|
||||
```sql
|
||||
CREATE TABLE token_counts (
|
||||
endpoint TEXT NOT NULL,
|
||||
model TEXT NOT NULL,
|
||||
input_tokens INTEGER NOT NULL,
|
||||
output_tokens INTEGER NOT NULL,
|
||||
total_tokens INTEGER NOT NULL,
|
||||
PRIMARY KEY (endpoint, model)
|
||||
);
|
||||
|
||||
CREATE TABLE time_series (
|
||||
endpoint TEXT NOT NULL,
|
||||
model TEXT NOT NULL,
|
||||
input_tokens INTEGER NOT NULL,
|
||||
output_tokens INTEGER NOT NULL,
|
||||
total_tokens INTEGER NOT NULL,
|
||||
timestamp INTEGER NOT NULL,
|
||||
PRIMARY KEY (endpoint, model, timestamp)
|
||||
);
|
||||
```
|
||||
|
||||
## Scaling Considerations
|
||||
|
||||
### Horizontal Scaling
|
||||
|
||||
- Multiple router instances can run behind a load balancer
|
||||
- Each instance maintains its own connection tracking
|
||||
- Stateless design allows for easy scaling
|
||||
|
||||
### Vertical Scaling
|
||||
|
||||
- Connection limits can be increased via aiohttp connector settings
|
||||
- Memory usage grows with number of tracked connections
|
||||
- Token buffer flushing interval can be adjusted
|
||||
|
||||
## Security
|
||||
|
||||
### Authentication
|
||||
|
||||
- API keys are stored in config.yaml (can use environment variables)
|
||||
- Keys are passed to endpoints via Authorization headers
|
||||
- No authentication required for router itself (can be added via middleware)
|
||||
|
||||
### Data Protection
|
||||
|
||||
- All communication uses TLS when configured
|
||||
- No sensitive data logged (except in error messages)
|
||||
- Database contains only token counts and timestamps
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### Metrics Endpoints
|
||||
|
||||
- `/api/usage`: Current connection counts
|
||||
- `/api/token_counts`: Aggregated token usage
|
||||
- `/api/stats`: Detailed statistics per model
|
||||
- `/api/config`: Endpoint configuration and status
|
||||
- `/api/usage-stream`: Real-time usage updates via SSE
|
||||
|
||||
### Logging
|
||||
|
||||
- Connection errors are logged with detailed context
|
||||
- Endpoint selection decisions are logged
|
||||
- Token counting operations are logged at debug level
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
Potential areas for improvement:
|
||||
|
||||
- Kubernetes operator for automatic deployment
|
||||
- Prometheus metrics endpoint
|
||||
- Distributed connection tracking (Redis)
|
||||
- Request retry logic with exponential backoff
|
||||
- Circuit breaker pattern for failing endpoints
|
||||
- Rate limiting per client
|
||||
197
doc/configuration.md
Normal file
197
doc/configuration.md
Normal file
|
|
@ -0,0 +1,197 @@
|
|||
# Configuration Guide
|
||||
|
||||
## Configuration File
|
||||
|
||||
The NOMYO Router is configured via a YAML file (default: `config.yaml`). This file defines the Ollama endpoints, connection limits, and API keys.
|
||||
|
||||
### Basic Configuration
|
||||
|
||||
```yaml
|
||||
# config.yaml
|
||||
endpoints:
|
||||
- http://localhost:11434
|
||||
- http://ollama-server:11434
|
||||
|
||||
# Maximum concurrent connections *per endpoint‑model pair*
|
||||
max_concurrent_connections: 2
|
||||
```
|
||||
|
||||
### Complete Example
|
||||
|
||||
```yaml
|
||||
# config.yaml
|
||||
endpoints:
|
||||
- http://192.168.0.50:11434
|
||||
- http://192.168.0.51:11434
|
||||
- http://192.168.0.52:11434
|
||||
- https://api.openai.com/v1
|
||||
|
||||
# Maximum concurrent connections *per endpoint‑model pair* (equals to OLLAMA_NUM_PARALLEL)
|
||||
max_concurrent_connections: 2
|
||||
|
||||
# API keys for remote endpoints
|
||||
# Set an environment variable like OPENAI_KEY
|
||||
# Confirm endpoints are exactly as in endpoints block
|
||||
api_keys:
|
||||
"http://192.168.0.50:11434": "ollama"
|
||||
"http://192.168.0.51:11434": "ollama"
|
||||
"http://192.168.0.52:11434": "ollama"
|
||||
"https://api.openai.com/v1": "${OPENAI_KEY}"
|
||||
```
|
||||
|
||||
## Configuration Options
|
||||
|
||||
### `endpoints`
|
||||
|
||||
**Type**: `list[str]`
|
||||
|
||||
**Description**: List of Ollama endpoint URLs. Can include both Ollama endpoints (`http://host:11434`) and OpenAI-compatible endpoints (`https://api.openai.com/v1`).
|
||||
|
||||
**Examples**:
|
||||
```yaml
|
||||
endpoints:
|
||||
- http://localhost:11434
|
||||
- http://ollama1:11434
|
||||
- http://ollama2:11434
|
||||
- https://api.openai.com/v1
|
||||
- https://api.anthropic.com/v1
|
||||
```
|
||||
|
||||
**Notes**:
|
||||
- Ollama endpoints use the standard `/api/` prefix
|
||||
- OpenAI-compatible endpoints use `/v1` prefix
|
||||
- The router automatically detects endpoint type based on URL pattern
|
||||
|
||||
### `max_concurrent_connections`
|
||||
|
||||
**Type**: `int`
|
||||
|
||||
**Default**: `1`
|
||||
|
||||
**Description**: Maximum number of concurrent connections allowed per endpoint-model pair. This corresponds to Ollama's `OLLAMA_NUM_PARALLEL` setting.
|
||||
|
||||
**Example**:
|
||||
```yaml
|
||||
max_concurrent_connections: 4
|
||||
```
|
||||
|
||||
**Notes**:
|
||||
- This setting controls how many requests can be processed simultaneously for a specific model on a specific endpoint
|
||||
- When this limit is reached, the router will route requests to other endpoints with available capacity
|
||||
- Higher values allow more parallel requests but may increase memory usage
|
||||
|
||||
### `api_keys`
|
||||
|
||||
**Type**: `dict[str, str]`
|
||||
|
||||
**Description**: Mapping of endpoint URLs to API keys. Used for authenticating with remote endpoints.
|
||||
|
||||
**Example**:
|
||||
```yaml
|
||||
api_keys:
|
||||
"http://192.168.0.50:11434": "ollama"
|
||||
"https://api.openai.com/v1": "${OPENAI_KEY}"
|
||||
```
|
||||
|
||||
**Environment Variables**:
|
||||
- API keys can reference environment variables using `${VAR_NAME}` syntax
|
||||
- The router will expand these references at startup
|
||||
- Example: `${OPENAI_KEY}` will be replaced with the value of the `OPENAI_KEY` environment variable
|
||||
|
||||
## Environment Variables
|
||||
|
||||
### `NOMYO_ROUTER_CONFIG_PATH`
|
||||
|
||||
**Description**: Path to the configuration file. If not set, defaults to `config.yaml` in the current working directory.
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
export NOMYO_ROUTER_CONFIG_PATH=/etc/nomyo-router/config.yaml
|
||||
```
|
||||
|
||||
### `NOMYO_ROUTER_DB_PATH`
|
||||
|
||||
**Description**: Path to the SQLite database file for storing token counts. If not set, defaults to `token_counts.db` in the current working directory.
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
export NOMYO_ROUTER_DB_PATH=/var/lib/nomyo-router/token_counts.db
|
||||
```
|
||||
|
||||
### API-Specific Keys
|
||||
|
||||
You can set API keys directly as environment variables:
|
||||
|
||||
```bash
|
||||
export OPENAI_KEY=your_openai_api_key
|
||||
export ANTHROPIC_KEY=your_anthropic_api_key
|
||||
```
|
||||
|
||||
## Configuration Best Practices
|
||||
|
||||
### Multiple Ollama Instances
|
||||
|
||||
For a cluster of Ollama instances:
|
||||
|
||||
```yaml
|
||||
endpoints:
|
||||
- http://ollama-worker1:11434
|
||||
- http://ollama-worker2:11434
|
||||
- http://ollama-worker3:11434
|
||||
|
||||
max_concurrent_connections: 2
|
||||
```
|
||||
|
||||
**Recommendation**: Set `max_concurrent_connections` to match your Ollama instances' `OLLAMA_NUM_PARALLEL` setting.
|
||||
|
||||
### Mixed Endpoints
|
||||
|
||||
Combining Ollama and OpenAI endpoints:
|
||||
|
||||
```yaml
|
||||
endpoints:
|
||||
- http://localhost:11434
|
||||
- https://api.openai.com/v1
|
||||
|
||||
api_keys:
|
||||
"https://api.openai.com/v1": "${OPENAI_KEY}"
|
||||
```
|
||||
|
||||
**Note**: The router will automatically route requests based on model availability across all endpoints.
|
||||
|
||||
### High Availability
|
||||
|
||||
For production deployments:
|
||||
|
||||
```yaml
|
||||
endpoints:
|
||||
- http://ollama-primary:11434
|
||||
- http://ollama-secondary:11434
|
||||
- http://ollama-tertiary:11434
|
||||
|
||||
max_concurrent_connections: 3
|
||||
```
|
||||
|
||||
**Recommendation**: Use multiple endpoints for redundancy and load distribution.
|
||||
|
||||
## Configuration Validation
|
||||
|
||||
The router validates the configuration at startup:
|
||||
|
||||
1. **Endpoint URLs**: Must be valid URLs
|
||||
2. **API Keys**: Must be strings (can reference environment variables)
|
||||
3. **Connection Limits**: Must be positive integers
|
||||
|
||||
If the configuration is invalid, the router will exit with an error message.
|
||||
|
||||
## Dynamic Configuration
|
||||
|
||||
The configuration is loaded at startup and cannot be changed without restarting the router. For production deployments, consider:
|
||||
|
||||
1. Using a configuration management system
|
||||
2. Implementing a rolling restart strategy
|
||||
3. Using environment variables for sensitive data
|
||||
|
||||
## Example Configurations
|
||||
|
||||
See the [examples](examples/) directory for ready-to-use configuration examples.
|
||||
444
doc/deployment.md
Normal file
444
doc/deployment.md
Normal file
|
|
@ -0,0 +1,444 @@
|
|||
# Deployment Guide
|
||||
|
||||
## Deployment Options
|
||||
|
||||
NOMYO Router can be deployed in various environments depending on your requirements.
|
||||
|
||||
## 1. Bare Metal / VM Deployment
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Python 3.10+
|
||||
- pip
|
||||
- Virtual environment (recommended)
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
git clone https://github.com/nomyo-ai/nomyo-router.git
|
||||
cd nomyo-router
|
||||
|
||||
# Create virtual environment
|
||||
python3 -m venv .venv/router
|
||||
source .venv/router/bin/activate
|
||||
|
||||
# Install dependencies
|
||||
pip3 install -r requirements.txt
|
||||
|
||||
# Configure endpoints
|
||||
nano config.yaml
|
||||
```
|
||||
|
||||
### Running the Router
|
||||
|
||||
```bash
|
||||
# Basic startup
|
||||
uvicorn router:app --host 0.0.0.0 --port 12434
|
||||
|
||||
# With custom configuration path
|
||||
export NOMYO_ROUTER_CONFIG_PATH=/etc/nomyo-router/config.yaml
|
||||
uvicorn router:app --host 0.0.0.0 --port 12434
|
||||
|
||||
# With custom database path
|
||||
export NOMYO_ROUTER_DB_PATH=/var/lib/nomyo-router/token_counts.db
|
||||
uvicorn router:app --host 0.0.0.0 --port 12434
|
||||
```
|
||||
|
||||
### Systemd Service
|
||||
|
||||
Create `/etc/systemd/system/nomyo-router.service`:
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=NOMYO Router - Ollama Proxy
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
User=nomyo
|
||||
Group=nomyo
|
||||
WorkingDirectory=/opt/nomyo-router
|
||||
Environment="NOMYO_ROUTER_CONFIG_PATH=/etc/nomyo-router/config.yaml"
|
||||
Environment="NOMYO_ROUTER_DB_PATH=/var/lib/nomyo-router/token_counts.db"
|
||||
ExecStart=/opt/nomyo-router/.venv/router/bin/uvicorn router:app --host 0.0.0.0 --port 12434
|
||||
Restart=always
|
||||
RestartSec=10
|
||||
StandardOutput=syslog
|
||||
StandardError=syslog
|
||||
SyslogIdentifier=nomyo-router
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
Enable and start the service:
|
||||
|
||||
```bash
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable nomyo-router
|
||||
sudo systemctl start nomyo-router
|
||||
sudo systemctl status nomyo-router
|
||||
```
|
||||
|
||||
## 2. Docker Deployment
|
||||
|
||||
### Build the Image
|
||||
|
||||
```bash
|
||||
docker build -t nomyo-router .
|
||||
```
|
||||
|
||||
### Run the Container
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
--name nomyo-router \
|
||||
-p 12434:12434 \
|
||||
-v /absolute/path/to/config_folder:/app/config/ \
|
||||
-e CONFIG_PATH=/app/config/config.yaml \
|
||||
nomyo-router
|
||||
```
|
||||
|
||||
### Advanced Docker Configuration
|
||||
|
||||
#### Custom Port
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
--name nomyo-router \
|
||||
-p 9000:12434 \
|
||||
-v /path/to/config:/app/config/ \
|
||||
-e CONFIG_PATH=/app/config/config.yaml \
|
||||
nomyo-router \
|
||||
-- --port 9000
|
||||
```
|
||||
|
||||
#### Custom Host
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
--name nomyo-router \
|
||||
-p 12434:12434 \
|
||||
-v /path/to/config:/app/config/ \
|
||||
-e CONFIG_PATH=/app/config/config.yaml \
|
||||
-e UVICORN_HOST=0.0.0.0 \
|
||||
nomyo-router
|
||||
```
|
||||
|
||||
#### Persistent Database
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
--name nomyo-router \
|
||||
-p 12434:12434 \
|
||||
-v /path/to/config:/app/config/ \
|
||||
-v /path/to/db:/app/token_counts.db \
|
||||
-e CONFIG_PATH=/app/config/config.yaml \
|
||||
-e NOMYO_ROUTER_DB_PATH=/app/token_counts.db \
|
||||
nomyo-router
|
||||
```
|
||||
|
||||
### Docker Compose Example
|
||||
|
||||
See [examples/docker-compose.yml](examples/docker-compose.yml) for a complete Docker Compose example.
|
||||
|
||||
## 3. Kubernetes Deployment
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Kubernetes cluster
|
||||
- kubectl configured
|
||||
- Helm (optional)
|
||||
|
||||
### Basic Deployment
|
||||
|
||||
Create `nomyo-router-deployment.yaml`:
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: nomyo-router
|
||||
labels:
|
||||
app: nomyo-router
|
||||
spec:
|
||||
replicas: 2
|
||||
selector:
|
||||
matchLabels:
|
||||
app: nomyo-router
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: nomyo-router
|
||||
spec:
|
||||
containers:
|
||||
- name: nomyo-router
|
||||
image: nomyo-router:latest
|
||||
ports:
|
||||
- containerPort: 12434
|
||||
env:
|
||||
- name: CONFIG_PATH
|
||||
value: "/app/config/config.yaml"
|
||||
- name: NOMYO_ROUTER_DB_PATH
|
||||
value: "/app/token_counts.db"
|
||||
volumeMounts:
|
||||
- name: config-volume
|
||||
mountPath: /app/config
|
||||
- name: db-volume
|
||||
mountPath: /app/token_counts.db
|
||||
subPath: token_counts.db
|
||||
volumes:
|
||||
- name: config-volume
|
||||
configMap:
|
||||
name: nomyo-router-config
|
||||
- name: db-volume
|
||||
persistentVolumeClaim:
|
||||
claimName: nomyo-router-db-pvc
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: nomyo-router
|
||||
spec:
|
||||
selector:
|
||||
app: nomyo-router
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 80
|
||||
targetPort: 12434
|
||||
type: LoadBalancer
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: nomyo-router-config
|
||||
data:
|
||||
config.yaml: |
|
||||
endpoints:
|
||||
- http://ollama-service:11434
|
||||
max_concurrent_connections: 2
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: nomyo-router-db-pvc
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
resources:
|
||||
requests:
|
||||
storage: 1Gi
|
||||
```
|
||||
|
||||
Apply the deployment:
|
||||
|
||||
```bash
|
||||
kubectl apply -f nomyo-router-deployment.yaml
|
||||
```
|
||||
|
||||
### Horizontal Pod Autoscaler
|
||||
|
||||
Create `nomyo-router-hpa.yaml`:
|
||||
|
||||
```yaml
|
||||
apiVersion: autoscaling/v2
|
||||
kind: HorizontalPodAutoscaler
|
||||
metadata:
|
||||
name: nomyo-router-hpa
|
||||
spec:
|
||||
scaleTargetRef:
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
name: nomyo-router
|
||||
minReplicas: 2
|
||||
maxReplicas: 10
|
||||
metrics:
|
||||
- type: Resource
|
||||
resource:
|
||||
name: cpu
|
||||
target:
|
||||
type: Utilization
|
||||
averageUtilization: 70
|
||||
```
|
||||
|
||||
Apply the HPA:
|
||||
|
||||
```bash
|
||||
kubectl apply -f nomyo-router-hpa.yaml
|
||||
```
|
||||
|
||||
## 4. Production Deployment
|
||||
|
||||
### High Availability Setup
|
||||
|
||||
For production environments with multiple Ollama instances:
|
||||
|
||||
```yaml
|
||||
# config.yaml
|
||||
endpoints:
|
||||
- http://ollama-worker1:11434
|
||||
- http://ollama-worker2:11434
|
||||
- http://ollama-worker3:11434
|
||||
- https://api.openai.com/v1
|
||||
|
||||
max_concurrent_connections: 4
|
||||
|
||||
api_keys:
|
||||
"https://api.openai.com/v1": "${OPENAI_KEY}"
|
||||
```
|
||||
|
||||
### Load Balancing
|
||||
|
||||
Deploy multiple router instances behind a load balancer:
|
||||
|
||||
```
|
||||
┌───────────────────────────────────────────────────────────────┐
|
||||
│ Load Balancer (NGINX, Traefik) │
|
||||
└───────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
├─┬───────────────────────────────────────┐
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐
|
||||
│ Router Instance │ │ Router Instance │ │ Router Instance │
|
||||
│ (Pod 1) │ │ (Pod 2) │ │ (Pod 3) │
|
||||
└─────────────────┘ └─────────────────┘ └─────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌───────────────────────────────────────────────────────────────┐
|
||||
│ Ollama Cluster │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────────┐ │
|
||||
│ │ Ollama │ │ Ollama │ │ OpenAI API │ │
|
||||
│ │ Worker 1 │ │ Worker 2 │ │ (Fallback) │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────────────────────┘ │
|
||||
└───────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Monitoring and Logging
|
||||
|
||||
#### Prometheus Monitoring
|
||||
|
||||
Create a Prometheus scrape configuration:
|
||||
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: 'nomyo-router'
|
||||
metrics_path: '/metrics'
|
||||
static_configs:
|
||||
- targets: ['nomyo-router:12434']
|
||||
```
|
||||
|
||||
#### Logging
|
||||
|
||||
Configure log aggregation:
|
||||
|
||||
```bash
|
||||
# In Docker
|
||||
docker run -d \
|
||||
--name nomyo-router \
|
||||
-p 12434:12434 \
|
||||
-v /path/to/config:/app/config/ \
|
||||
-e CONFIG_PATH=/app/config/config.yaml \
|
||||
--log-driver=fluentd \
|
||||
--log-opt fluentd-address=fluentd:24224 \
|
||||
nomyo-router
|
||||
```
|
||||
|
||||
## Deployment Checklist
|
||||
|
||||
### Pre-Deployment
|
||||
|
||||
- [ ] Configure all Ollama endpoints
|
||||
- [ ] Set appropriate `max_concurrent_connections`
|
||||
- [ ] Configure API keys for remote endpoints
|
||||
- [ ] Test configuration locally
|
||||
- [ ] Set up monitoring and alerting
|
||||
- [ ] Configure logging
|
||||
- [ ] Set up backup for token counts database
|
||||
|
||||
### Post-Deployment
|
||||
|
||||
- [ ] Verify health endpoint: `curl http://<router>/health`
|
||||
- [ ] Check endpoint status: `curl http://<router>/api/config`
|
||||
- [ ] Monitor connection counts: `curl http://<router>/api/usage`
|
||||
- [ ] Set up regular backups
|
||||
- [ ] Configure auto-restart on failure
|
||||
- [ ] Monitor performance metrics
|
||||
|
||||
## Scaling Guidelines
|
||||
|
||||
### Vertical Scaling
|
||||
|
||||
- Increase `max_concurrent_connections` for more parallel requests
|
||||
- Add more CPU/memory to the router instance
|
||||
- Monitor memory usage (token buffer grows with usage)
|
||||
|
||||
### Horizontal Scaling
|
||||
|
||||
- Deploy multiple router instances
|
||||
- Use a load balancer to distribute traffic
|
||||
- Each instance maintains its own connection tracking
|
||||
- Database can be shared or per-instance
|
||||
|
||||
### Database Considerations
|
||||
|
||||
- SQLite is sufficient for single-instance deployments
|
||||
- For multi-instance deployments, consider PostgreSQL
|
||||
- Regular backups are recommended
|
||||
- Database size grows with token usage history
|
||||
|
||||
## Security Best Practices
|
||||
|
||||
### Network Security
|
||||
|
||||
- Use TLS for all external connections
|
||||
- Restrict access to router port (12434)
|
||||
- Use firewall rules to limit access
|
||||
- Consider using VPN for internal communications
|
||||
|
||||
### Configuration Security
|
||||
|
||||
- Store API keys in environment variables
|
||||
- Restrict access to config.yaml
|
||||
- Use secrets management for production deployments
|
||||
- Rotate API keys regularly
|
||||
|
||||
### Runtime Security
|
||||
|
||||
- Run as non-root user
|
||||
- Set appropriate file permissions
|
||||
- Monitor for suspicious activity
|
||||
- Keep dependencies updated
|
||||
|
||||
## Troubleshooting Deployment Issues
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Problem**: Router not starting
|
||||
|
||||
- **Check**: Logs for configuration errors
|
||||
- **Solution**: Validate config.yaml syntax
|
||||
|
||||
**Problem**: Endpoints showing as unavailable
|
||||
|
||||
- **Check**: Network connectivity from router to endpoints
|
||||
- **Solution**: Verify firewall rules and DNS resolution
|
||||
|
||||
**Problem**: High latency
|
||||
|
||||
- **Check**: Endpoint health and connection counts
|
||||
- **Solution**: Add more endpoints or increase concurrency limits
|
||||
|
||||
**Problem**: Database errors
|
||||
|
||||
- **Check**: Database file permissions
|
||||
- **Solution**: Ensure write permissions for the database path
|
||||
|
||||
**Problem**: Connection limits being hit
|
||||
|
||||
- **Check**: `/api/usage` endpoint
|
||||
- **Solution**: Increase `max_concurrent_connections` or add endpoints
|
||||
|
||||
## Examples
|
||||
|
||||
See the [examples](examples/) directory for ready-to-use deployment examples.
|
||||
98
doc/examples/docker-compose.yml
Normal file
98
doc/examples/docker-compose.yml
Normal file
|
|
@ -0,0 +1,98 @@
|
|||
# Docker Compose example for NOMYO Router with multiple Ollama instances
|
||||
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
# NOMYO Router
|
||||
nomyo-router:
|
||||
image: nomyo-router:latest
|
||||
build: .
|
||||
ports:
|
||||
- "12434:12434"
|
||||
environment:
|
||||
- CONFIG_PATH=/app/config/config.yaml
|
||||
- NOMYO_ROUTER_DB_PATH=/app/token_counts.db
|
||||
volumes:
|
||||
- ./config:/app/config
|
||||
- router-db:/app/token_counts.db
|
||||
depends_on:
|
||||
- ollama1
|
||||
- ollama2
|
||||
- ollama3
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- nomyo-net
|
||||
|
||||
# Ollama Instance 1
|
||||
ollama1:
|
||||
image: ollama/ollama:latest
|
||||
ports:
|
||||
- "11434:11434"
|
||||
volumes:
|
||||
- ollama1-data:/root/.ollama
|
||||
environment:
|
||||
- OLLAMA_NUM_PARALLEL=4
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- nomyo-net
|
||||
|
||||
# Ollama Instance 2
|
||||
ollama2:
|
||||
image: ollama/ollama:latest
|
||||
ports:
|
||||
- "11435:11434"
|
||||
volumes:
|
||||
- ollama2-data:/root/.ollama
|
||||
environment:
|
||||
- OLLAMA_NUM_PARALLEL=4
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- nomyo-net
|
||||
|
||||
# Ollama Instance 3
|
||||
ollama3:
|
||||
image: ollama/ollama:latest
|
||||
ports:
|
||||
- "11436:11434"
|
||||
volumes:
|
||||
- ollama3-data:/root/.ollama
|
||||
environment:
|
||||
- OLLAMA_NUM_PARALLEL=4
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- nomyo-net
|
||||
|
||||
# Optional: Prometheus for monitoring
|
||||
prometheus:
|
||||
image: prom/prometheus:latest
|
||||
ports:
|
||||
- "9090:9090"
|
||||
volumes:
|
||||
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
|
||||
command:
|
||||
- '--config.file=/etc/prometheus/prometheus.yml'
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- nomyo-net
|
||||
|
||||
# Optional: Grafana for visualization
|
||||
grafana:
|
||||
image: grafana/grafana:latest
|
||||
ports:
|
||||
- "3000:3000"
|
||||
volumes:
|
||||
- grafana-storage:/var/lib/grafana
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- nomyo-net
|
||||
|
||||
volumes:
|
||||
router-db:
|
||||
ollama1-data:
|
||||
ollama2-data:
|
||||
ollama3-data:
|
||||
grafana-storage:
|
||||
|
||||
networks:
|
||||
nomyo-net:
|
||||
driver: bridge
|
||||
37
doc/examples/sample-config.yaml
Normal file
37
doc/examples/sample-config.yaml
Normal file
|
|
@ -0,0 +1,37 @@
|
|||
# Sample NOMYO Router Configuration
|
||||
|
||||
# Basic single endpoint configuration
|
||||
endpoints:
|
||||
- http://localhost:11434
|
||||
|
||||
max_concurrent_connections: 2
|
||||
|
||||
# Multi-endpoint configuration with local Ollama instances
|
||||
# endpoints:
|
||||
# - http://ollama-worker1:11434
|
||||
# - http://ollama-worker2:11434
|
||||
# - http://ollama-worker3:11434
|
||||
|
||||
# Mixed configuration with Ollama and OpenAI endpoints
|
||||
# endpoints:
|
||||
# - http://localhost:11434
|
||||
# - https://api.openai.com/v1
|
||||
|
||||
|
||||
# API keys for remote endpoints
|
||||
# Use ${VAR_NAME} syntax to reference environment variables
|
||||
api_keys:
|
||||
# Local Ollama instances typically don't require authentication
|
||||
"http://localhost:11434": "ollama"
|
||||
|
||||
# Remote Ollama instances
|
||||
# "http://remote-ollama:11434": "ollama"
|
||||
|
||||
# OpenAI API
|
||||
# "https://api.openai.com/v1": "${OPENAI_KEY}"
|
||||
|
||||
# Anthropic API
|
||||
# "https://api.anthropic.com/v1": "${ANTHROPIC_KEY}"
|
||||
|
||||
# Other OpenAI-compatible endpoints
|
||||
# "https://api.mistral.ai/v1": "${MISTRAL_KEY}"
|
||||
515
doc/monitoring.md
Normal file
515
doc/monitoring.md
Normal file
|
|
@ -0,0 +1,515 @@
|
|||
# Monitoring and Troubleshooting Guide
|
||||
|
||||
## Monitoring Overview
|
||||
|
||||
NOMYO Router provides comprehensive monitoring capabilities to track performance, health, and usage patterns.
|
||||
|
||||
## Monitoring Endpoints
|
||||
|
||||
### Health Check
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/health
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"status": "ok" | "error",
|
||||
"endpoints": {
|
||||
"http://endpoint1:11434": {
|
||||
"status": "ok" | "error",
|
||||
"version": "string" | "detail": "error message"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**HTTP Status Codes**:
|
||||
- `200`: All endpoints healthy
|
||||
- `503`: One or more endpoints unhealthy
|
||||
|
||||
### Current Usage
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/usage
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"usage_counts": {
|
||||
"http://endpoint1:11434": {
|
||||
"llama3": 2,
|
||||
"mistral": 1
|
||||
},
|
||||
"http://endpoint2:11434": {
|
||||
"llama3": 0,
|
||||
"mistral": 3
|
||||
}
|
||||
},
|
||||
"token_usage_counts": {
|
||||
"http://endpoint1:11434": {
|
||||
"llama3": 1542,
|
||||
"mistral": 876
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Token Statistics
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/token_counts
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"total_tokens": 2418,
|
||||
"breakdown": [
|
||||
{
|
||||
"endpoint": "http://endpoint1:11434",
|
||||
"model": "llama3",
|
||||
"input_tokens": 120,
|
||||
"output_tokens": 1422,
|
||||
"total_tokens": 1542
|
||||
},
|
||||
{
|
||||
"endpoint": "http://endpoint1:11434",
|
||||
"model": "mistral",
|
||||
"input_tokens": 80,
|
||||
"output_tokens": 796,
|
||||
"total_tokens": 876
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Model Statistics
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/stats -X POST -d '{"model": "llama3"}'
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"model": "llama3",
|
||||
"input_tokens": 120,
|
||||
"output_tokens": 1422,
|
||||
"total_tokens": 1542,
|
||||
"time_series": [
|
||||
{
|
||||
"endpoint": "http://endpoint1:11434",
|
||||
"timestamp": 1712345600,
|
||||
"input_tokens": 20,
|
||||
"output_tokens": 150,
|
||||
"total_tokens": 170
|
||||
}
|
||||
],
|
||||
"endpoint_distribution": {
|
||||
"http://endpoint1:11434": 1542
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Configuration Status
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/config
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"endpoints": [
|
||||
{
|
||||
"url": "http://endpoint1:11434",
|
||||
"status": "ok" | "error",
|
||||
"version": "string" | "detail": "error message"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Real-time Usage Stream
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/usage-stream
|
||||
```
|
||||
|
||||
This provides Server-Sent Events (SSE) with real-time updates:
|
||||
```
|
||||
data: {"usage_counts": {...}, "token_usage_counts": {...}}
|
||||
|
||||
data: {"usage_counts": {...}, "token_usage_counts": {...}}
|
||||
```
|
||||
|
||||
## Monitoring Tools
|
||||
|
||||
### Prometheus Integration
|
||||
|
||||
Create a Prometheus scrape configuration:
|
||||
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: 'nomyo-router'
|
||||
metrics_path: '/api/usage'
|
||||
params:
|
||||
format: ['prometheus']
|
||||
static_configs:
|
||||
- targets: ['nomyo-router:12434']
|
||||
```
|
||||
|
||||
### Grafana Dashboard
|
||||
|
||||
Create a dashboard with these panels:
|
||||
- Endpoint health status
|
||||
- Current connection counts
|
||||
- Token usage (input/output/total)
|
||||
- Request rates
|
||||
- Response times
|
||||
- Error rates
|
||||
|
||||
### Logging
|
||||
|
||||
The router logs important events to stdout:
|
||||
- Configuration loading
|
||||
- Endpoint connection issues
|
||||
- Token counting operations
|
||||
- Error conditions
|
||||
|
||||
## Troubleshooting Guide
|
||||
|
||||
### Common Issues and Solutions
|
||||
|
||||
#### 1. Endpoint Unavailable
|
||||
|
||||
**Symptoms**:
|
||||
- Health check shows endpoint as "error"
|
||||
- Requests fail with connection errors
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
curl http://localhost:12434/health
|
||||
curl http://localhost:12434/api/config
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
- Verify Ollama endpoint is running
|
||||
- Check network connectivity
|
||||
- Verify firewall rules
|
||||
- Check DNS resolution
|
||||
- Test direct connection: `curl http://endpoint:11434/api/version`
|
||||
|
||||
#### 2. Model Not Found
|
||||
|
||||
**Symptoms**:
|
||||
- Error: "None of the configured endpoints advertise the model"
|
||||
- Requests fail with model not found
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
curl http://localhost:12434/api/tags
|
||||
curl http://endpoint:11434/api/tags
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
- Pull the model on the endpoint: `curl http://endpoint:11434/api/pull -d '{"name": "llama3"}'`
|
||||
- Verify model name spelling
|
||||
- Check if model is available on any endpoint
|
||||
- For OpenAI endpoints, ensure model exists in their catalog
|
||||
|
||||
#### 3. High Latency
|
||||
|
||||
**Symptoms**:
|
||||
- Slow response times
|
||||
- Requests timing out
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
curl http://localhost:12434/api/usage
|
||||
curl http://localhost:12434/api/config
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
- Check if endpoints are overloaded (high connection counts)
|
||||
- Increase `max_concurrent_connections`
|
||||
- Add more endpoints to the cluster
|
||||
- Monitor Ollama endpoint performance
|
||||
- Check network latency between router and endpoints
|
||||
|
||||
#### 4. Connection Limits Reached
|
||||
|
||||
**Symptoms**:
|
||||
- Requests queuing
|
||||
- High connection counts
|
||||
- Slow response times
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
curl http://localhost:12434/api/usage
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
- Increase `max_concurrent_connections` in config.yaml
|
||||
- Add more Ollama endpoints
|
||||
- Scale your Ollama cluster
|
||||
- Use MOE system for critical queries
|
||||
|
||||
#### 5. Token Tracking Not Working
|
||||
|
||||
**Symptoms**:
|
||||
- Token counts not updating
|
||||
- Database errors
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
ls -la token_counts.db
|
||||
curl http://localhost:12434/api/token_counts
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
- Verify database file permissions
|
||||
- Check if database path is writable
|
||||
- Restart router to rebuild database
|
||||
- Check disk space
|
||||
- Verify environment variable `NOMYO_ROUTER_DB_PATH`
|
||||
|
||||
#### 6. Streaming Issues
|
||||
|
||||
**Symptoms**:
|
||||
- Incomplete responses
|
||||
- Connection resets during streaming
|
||||
- Timeout errors
|
||||
|
||||
**Diagnosis**:
|
||||
- Check router logs for errors
|
||||
- Test with non-streaming requests
|
||||
- Monitor connection counts
|
||||
|
||||
**Solutions**:
|
||||
- Increase timeout settings
|
||||
- Reduce `max_concurrent_connections`
|
||||
- Check network stability
|
||||
- Test with smaller payloads
|
||||
|
||||
### Error Messages
|
||||
|
||||
#### "Failed to connect to endpoint"
|
||||
|
||||
**Cause**: Network connectivity issue
|
||||
**Action**: Verify endpoint is reachable, check firewall, test DNS
|
||||
|
||||
#### "None of the configured endpoints advertise the model"
|
||||
|
||||
**Cause**: Model not pulled on any endpoint
|
||||
**Action**: Pull the model, verify model name
|
||||
|
||||
#### "Timed out waiting for endpoint"
|
||||
|
||||
**Cause**: Endpoint slow to respond
|
||||
**Action**: Check endpoint health, increase timeouts
|
||||
|
||||
#### "Invalid JSON format in request body"
|
||||
|
||||
**Cause**: Malformed request
|
||||
**Action**: Validate request payload, check API documentation
|
||||
|
||||
#### "Missing required field 'model'"
|
||||
|
||||
**Cause**: Request missing model parameter
|
||||
**Action**: Add model parameter to request
|
||||
|
||||
### Performance Tuning
|
||||
|
||||
#### Optimizing Connection Handling
|
||||
|
||||
1. **Adjust concurrency limits**:
|
||||
```yaml
|
||||
max_concurrent_connections: 4
|
||||
```
|
||||
|
||||
2. **Monitor connection usage**:
|
||||
```bash
|
||||
curl http://localhost:12434/api/usage
|
||||
```
|
||||
|
||||
3. **Scale horizontally**:
|
||||
- Add more Ollama endpoints
|
||||
- Deploy multiple router instances
|
||||
|
||||
#### Reducing Latency
|
||||
|
||||
1. **Keep models loaded**:
|
||||
- Use frequently accessed models
|
||||
- Monitor `/api/ps` for loaded models
|
||||
|
||||
2. **Optimize endpoint selection**:
|
||||
- Distribute models across endpoints
|
||||
- Balance load evenly
|
||||
|
||||
3. **Use caching**:
|
||||
- Models cache (300s TTL)
|
||||
- Loaded models cache (30s TTL)
|
||||
|
||||
#### Memory Management
|
||||
|
||||
1. **Monitor memory usage**:
|
||||
- Token buffer grows with usage
|
||||
- Time-series data accumulates
|
||||
|
||||
2. **Adjust flush interval**:
|
||||
- Default: 10 seconds
|
||||
- Can be increased for less frequent I/O
|
||||
|
||||
3. **Database maintenance**:
|
||||
- Regular backups
|
||||
- Archive old data periodically
|
||||
|
||||
### Database Management
|
||||
|
||||
#### Viewing Token Data
|
||||
|
||||
```bash
|
||||
sqlite3 token_counts.db "SELECT * FROM token_counts;"
|
||||
sqlite3 token_counts.db "SELECT * FROM time_series LIMIT 100;"
|
||||
```
|
||||
|
||||
#### Aggregating Old Data
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/aggregate_time_series_days \
|
||||
-X POST \
|
||||
-d '{"days": 30, "trim_old": true}'
|
||||
```
|
||||
|
||||
#### Backing Up Database
|
||||
|
||||
```bash
|
||||
cp token_counts.db token_counts.db.backup
|
||||
```
|
||||
|
||||
#### Restoring Database
|
||||
|
||||
```bash
|
||||
cp token_counts.db.backup token_counts.db
|
||||
```
|
||||
|
||||
### Advanced Troubleshooting
|
||||
|
||||
#### Debugging Endpoint Selection
|
||||
|
||||
Enable debug logging to see endpoint selection decisions:
|
||||
```bash
|
||||
uvicorn router:app --host 0.0.0.0 --port 12434 --log-level debug
|
||||
```
|
||||
|
||||
#### Testing Individual Endpoints
|
||||
|
||||
```bash
|
||||
# Test endpoint directly
|
||||
curl http://endpoint:11434/api/version
|
||||
|
||||
# Test model availability
|
||||
curl http://endpoint:11434/api/tags
|
||||
|
||||
# Test model loading
|
||||
curl http://endpoint:11434/api/ps
|
||||
```
|
||||
|
||||
#### Network Diagnostics
|
||||
|
||||
```bash
|
||||
# Test connectivity
|
||||
nc -zv endpoint 11434
|
||||
|
||||
# Test DNS resolution
|
||||
dig endpoint
|
||||
|
||||
# Test latency
|
||||
ping endpoint
|
||||
```
|
||||
|
||||
### Common Pitfalls
|
||||
|
||||
1. **Using localhost in Docker**:
|
||||
- Inside Docker, `localhost` refers to the container itself
|
||||
- Use `host.docker.internal` or Docker service names
|
||||
|
||||
2. **Incorrect model names**:
|
||||
- Ollama: `llama3:latest`
|
||||
- OpenAI: `gpt-4` (no version suffix)
|
||||
|
||||
3. **Missing API keys**:
|
||||
- Remote endpoints require authentication
|
||||
- Set keys in config.yaml or environment variables
|
||||
|
||||
4. **Firewall blocking**:
|
||||
- Ensure port 11434 is open for Ollama
|
||||
- Ensure port 12434 is open for router
|
||||
|
||||
5. **Insufficient resources**:
|
||||
- Monitor CPU/memory on Ollama endpoints
|
||||
- Adjust `max_concurrent_connections` accordingly
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Monitoring Setup
|
||||
|
||||
1. **Set up health checks**:
|
||||
- Monitor `/health` endpoint
|
||||
- Alert on status "error"
|
||||
|
||||
2. **Track usage metrics**:
|
||||
- Monitor connection counts
|
||||
- Track token usage
|
||||
- Watch for connection limits
|
||||
|
||||
3. **Log important events**:
|
||||
- Configuration changes
|
||||
- Endpoint failures
|
||||
- Recovery events
|
||||
|
||||
4. **Regular backups**:
|
||||
- Backup token_counts.db
|
||||
- Schedule regular backups
|
||||
- Test restore procedure
|
||||
|
||||
### Performance Monitoring
|
||||
|
||||
1. **Baseline metrics**:
|
||||
- Establish normal usage patterns
|
||||
- Track trends over time
|
||||
|
||||
2. **Alert thresholds**:
|
||||
- Set alerts for high connection counts
|
||||
- Monitor error rates
|
||||
- Watch for latency spikes
|
||||
|
||||
3. **Capacity planning**:
|
||||
- Track growth in token usage
|
||||
- Plan for scaling needs
|
||||
- Monitor resource utilization
|
||||
|
||||
### Incident Response
|
||||
|
||||
1. **Quick diagnosis**:
|
||||
- Check health endpoint first
|
||||
- Review recent logs
|
||||
- Verify endpoint status
|
||||
|
||||
2. **Isolation**:
|
||||
- Identify affected endpoints
|
||||
- Isolate problematic components
|
||||
- Fallback to healthy endpoints
|
||||
|
||||
3. **Recovery**:
|
||||
- Restart router if needed
|
||||
- Rebalance load
|
||||
- Restore from backup if necessary
|
||||
|
||||
## Examples
|
||||
|
||||
See the [examples](examples/) directory for monitoring configuration examples.
|
||||
348
doc/usage.md
Normal file
348
doc/usage.md
Normal file
|
|
@ -0,0 +1,348 @@
|
|||
# Usage Guide
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Install the Router
|
||||
|
||||
```bash
|
||||
git clone https://github.com/nomyo-ai/nomyo-router.git
|
||||
cd nomyo-router
|
||||
python3 -m venv .venv/router
|
||||
source .venv/router/bin/activate
|
||||
pip3 install -r requirements.txt
|
||||
```
|
||||
|
||||
### 2. Configure Endpoints
|
||||
|
||||
Edit `config.yaml`:
|
||||
|
||||
```yaml
|
||||
endpoints:
|
||||
- http://localhost:11434
|
||||
|
||||
max_concurrent_connections: 2
|
||||
```
|
||||
|
||||
### 3. Run the Router
|
||||
|
||||
```bash
|
||||
uvicorn router:app --host 0.0.0.0 --port 12434
|
||||
```
|
||||
|
||||
### 4. Use the Router
|
||||
|
||||
Configure your frontend to point to `http://localhost:12434` instead of your Ollama instance.
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Ollama-Compatible Endpoints
|
||||
|
||||
The router provides all standard Ollama API endpoints:
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
| --------------- | ------ | --------------------- |
|
||||
| `/api/generate` | POST | Generate text |
|
||||
| `/api/chat` | POST | Chat completions |
|
||||
| `/api/embed` | POST | Embeddings |
|
||||
| `/api/tags` | GET | List available models |
|
||||
| `/api/ps` | GET | List loaded models |
|
||||
| `/api/show` | POST | Show model details |
|
||||
| `/api/pull` | POST | Pull a model |
|
||||
| `/api/push` | POST | Push a model |
|
||||
| `/api/create` | POST | Create a model |
|
||||
| `/api/copy` | POST | Copy a model |
|
||||
| `/api/delete` | DELETE | Delete a model |
|
||||
|
||||
### OpenAI-Compatible Endpoints
|
||||
|
||||
For OpenAI API compatibility:
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
| ---------------------- | ------ | ---------------- |
|
||||
| `/v1/chat/completions` | POST | Chat completions |
|
||||
| `/v1/completions` | POST | Text completions |
|
||||
| `/v1/embeddings` | POST | Embeddings |
|
||||
| `/v1/models` | GET | List models |
|
||||
|
||||
### Monitoring Endpoints
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
| ---------------------------------- | ------ | ---------------------------------------- |
|
||||
| `/api/usage` | GET | Current connection counts |
|
||||
| `/api/token_counts` | GET | Token usage statistics |
|
||||
| `/api/stats` | POST | Detailed model statistics |
|
||||
| `/api/aggregate_time_series_days` | POST | Aggregate time series data into daily |
|
||||
| `/api/version` | GET | Ollama version info |
|
||||
| `/api/config` | GET | Endpoint configuration |
|
||||
| `/api/usage-stream` | GET | Real-time usage updates (SSE) |
|
||||
| `/health` | GET | Health check |
|
||||
|
||||
## Making Requests
|
||||
|
||||
### Basic Chat Request
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/chat \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama3:latest",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Hello, how are you?"}
|
||||
],
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
### Streaming Response
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/chat \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama3:latest",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Tell me a story"}
|
||||
],
|
||||
"stream": true
|
||||
}'
|
||||
```
|
||||
|
||||
### OpenAI API Format
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "gpt-4o-nano",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Hello"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Multiple Opinions Ensemble (MOE)
|
||||
|
||||
Prefix your model name with `moe-` to enable the MOE system:
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/chat \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "moe-llama3:latest",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Explain quantum computing"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
The MOE system:
|
||||
|
||||
1. Generates 3 responses from different endpoints
|
||||
2. Generates 3 critiques of those responses
|
||||
3. Selects the best response
|
||||
4. Generates a final refined response
|
||||
|
||||
### Token Tracking
|
||||
|
||||
The router automatically tracks token usage:
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/token_counts
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
```json
|
||||
{
|
||||
"total_tokens": 1542,
|
||||
"breakdown": [
|
||||
{
|
||||
"endpoint": "http://localhost:11434",
|
||||
"model": "llama3",
|
||||
"input_tokens": 120,
|
||||
"output_tokens": 1422,
|
||||
"total_tokens": 1542
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Real-time Monitoring
|
||||
|
||||
Use Server-Sent Events to monitor usage in real-time:
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/usage-stream
|
||||
```
|
||||
|
||||
## Integration Examples
|
||||
|
||||
### Python Client
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
url = "http://localhost:12434/api/chat"
|
||||
data = {
|
||||
"model": "llama3",
|
||||
"messages": [{"role": "user", "content": "What is AI?"}],
|
||||
"stream": False
|
||||
}
|
||||
|
||||
response = requests.post(url, json=data)
|
||||
print(response.json())
|
||||
```
|
||||
|
||||
### JavaScript Client
|
||||
|
||||
```javascript
|
||||
const response = await fetch('http://localhost:12434/api/chat', {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Content-Type': 'application/json',
|
||||
},
|
||||
body: JSON.stringify({
|
||||
model: 'llama3',
|
||||
messages: [{ role: 'user', content: 'Hello!' }],
|
||||
stream: false
|
||||
})
|
||||
});
|
||||
|
||||
const data = await response.json();
|
||||
console.log(data);
|
||||
```
|
||||
|
||||
### Streaming with JavaScript
|
||||
|
||||
```javascript
|
||||
const eventSource = new EventSource('http://localhost:12434/api/usage-stream');
|
||||
|
||||
eventSource.onmessage = (event) => {
|
||||
const usage = JSON.parse(event.data);
|
||||
console.log('Current usage:', usage);
|
||||
};
|
||||
```
|
||||
|
||||
## Python Ollama Client
|
||||
|
||||
```python
|
||||
from ollama import Client
|
||||
|
||||
# Configure the client to use the router
|
||||
client = Client(host='http://localhost:12434')
|
||||
|
||||
# Chat with a model
|
||||
response = client.chat(
|
||||
model='llama3:latest',
|
||||
messages=[
|
||||
{'role': 'user', 'content': 'Explain quantum computing'}
|
||||
]
|
||||
)
|
||||
print(response['message']['content'])
|
||||
|
||||
# Generate text
|
||||
response = client.generate(
|
||||
model='llama3:latest',
|
||||
prompt='Write a short poem about AI'
|
||||
)
|
||||
print(response['response'])
|
||||
|
||||
# List available models
|
||||
models = client.list()['models']
|
||||
print(f"Available models: {[m['name'] for m in models]}")
|
||||
```
|
||||
|
||||
### Python OpenAI Client
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
# Configure the client to use the router
|
||||
client = OpenAI(
|
||||
base_url='http://localhost:12434/v1',
|
||||
api_key='not-needed' # API key is not required for local usage
|
||||
)
|
||||
|
||||
# Chat completions
|
||||
response = client.chat.completions.create(
|
||||
model='gpt-4o-nano',
|
||||
messages=[
|
||||
{'role': 'user', 'content': 'What is the meaning of life?'}
|
||||
]
|
||||
)
|
||||
print(response.choices[0].message.content)
|
||||
|
||||
# Text completions
|
||||
response = client.completions.create(
|
||||
model='llama3:latest',
|
||||
prompt='Once upon a time'
|
||||
)
|
||||
print(response.choices[0].text)
|
||||
|
||||
# Embeddings
|
||||
response = client.embeddings.create(
|
||||
model='llama3:latest',
|
||||
input='The quick brown fox jumps over the lazy dog'
|
||||
)
|
||||
print(f"Embedding length: {len(response.data[0].embedding)}")
|
||||
|
||||
# List models
|
||||
response = client.models.list()
|
||||
print(f"Available models: {[m.id for m in response.data]}")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Model Selection
|
||||
|
||||
- Use the same model name across all endpoints
|
||||
- For Ollama, append `:latest` or a specific version tag
|
||||
- For OpenAI endpoints, use the model name without version
|
||||
|
||||
### 2. Connection Management
|
||||
|
||||
- Set `max_concurrent_connections` appropriately for your hardware
|
||||
- Monitor `/api/usage` to ensure endpoints aren't overloaded
|
||||
- Consider using the MOE system for critical queries
|
||||
|
||||
### 3. Error Handling
|
||||
|
||||
- Check the `/health` endpoint regularly
|
||||
- Implement retry logic with exponential backoff
|
||||
- Monitor error rates and connection failures
|
||||
|
||||
### 4. Performance
|
||||
|
||||
- Keep frequently used models loaded on multiple endpoints
|
||||
- Use streaming for large responses
|
||||
- Monitor token usage to optimize costs
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Problem**: Model not found
|
||||
|
||||
- **Solution**: Ensure the model is pulled on at least one endpoint
|
||||
- **Check**: `curl http://localhost:12434/api/tags`
|
||||
|
||||
**Problem**: Connection refused
|
||||
|
||||
- **Solution**: Verify Ollama endpoints are running and accessible
|
||||
- **Check**: `curl http://localhost:12434/health`
|
||||
|
||||
**Problem**: High latency
|
||||
|
||||
- **Solution**: Check endpoint health and connection counts
|
||||
- **Check**: `curl http://localhost:12434/api/usage`
|
||||
|
||||
**Problem**: Token tracking not working
|
||||
|
||||
- **Solution**: Ensure database path is writable
|
||||
- **Check**: `ls -la token_counts.db`
|
||||
|
||||
## Examples
|
||||
|
||||
See the [examples](examples/) directory for complete integration examples.
|
||||
174
router.py
174
router.py
|
|
@ -9,6 +9,9 @@ license: AGPL
|
|||
import orjson, time, asyncio, yaml, ollama, openai, os, re, aiohttp, ssl, random, base64, io, enhance
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
# Directory containing static files (relative to this script)
|
||||
STATIC_DIR = Path(__file__).parent / "static"
|
||||
from typing import Dict, Set, List, Optional
|
||||
from urllib.parse import urlparse
|
||||
from fastapi import FastAPI, Request, HTTPException
|
||||
|
|
@ -55,6 +58,8 @@ flush_task: asyncio.Task | None = None
|
|||
token_buffer: dict[str, dict[str, tuple[int, int]]] = defaultdict(lambda: defaultdict(lambda: (0, 0)))
|
||||
# Time series buffer with timestamp
|
||||
time_series_buffer: list[dict[str, int | str]] = []
|
||||
# Lock to protect buffer access from race conditions
|
||||
buffer_lock = asyncio.Lock()
|
||||
|
||||
# Configuration for periodic flushing
|
||||
FLUSH_INTERVAL = 10 # seconds
|
||||
|
|
@ -218,45 +223,112 @@ def is_ext_openai_endpoint(endpoint: str) -> bool:
|
|||
return True # It's an external OpenAI endpoint
|
||||
|
||||
async def token_worker() -> None:
|
||||
while True:
|
||||
endpoint, model, prompt, comp = await token_queue.get()
|
||||
# Accumulate counts in memory buffer
|
||||
token_buffer[endpoint][model] = (
|
||||
token_buffer[endpoint].get(model, (0, 0))[0] + prompt,
|
||||
token_buffer[endpoint].get(model, (0, 0))[1] + comp
|
||||
)
|
||||
try:
|
||||
while True:
|
||||
endpoint, model, prompt, comp = await token_queue.get()
|
||||
# Accumulate counts in memory buffer (protected by lock)
|
||||
async with buffer_lock:
|
||||
token_buffer[endpoint][model] = (
|
||||
token_buffer[endpoint].get(model, (0, 0))[0] + prompt,
|
||||
token_buffer[endpoint].get(model, (0, 0))[1] + comp
|
||||
)
|
||||
|
||||
# Add to time series buffer with timestamp (UTC)
|
||||
now = datetime.now(tz=timezone.utc)
|
||||
timestamp = int(datetime(now.year, now.month, now.day, now.hour, now.minute, tzinfo=timezone.utc).timestamp())
|
||||
time_series_buffer.append({
|
||||
'endpoint': endpoint,
|
||||
'model': model,
|
||||
'input_tokens': prompt,
|
||||
'output_tokens': comp,
|
||||
'total_tokens': prompt + comp,
|
||||
'timestamp': timestamp
|
||||
})
|
||||
# Add to time series buffer with timestamp (UTC)
|
||||
now = datetime.now(tz=timezone.utc)
|
||||
timestamp = int(datetime(now.year, now.month, now.day, now.hour, now.minute, tzinfo=timezone.utc).timestamp())
|
||||
time_series_buffer.append({
|
||||
'endpoint': endpoint,
|
||||
'model': model,
|
||||
'input_tokens': prompt,
|
||||
'output_tokens': comp,
|
||||
'total_tokens': prompt + comp,
|
||||
'timestamp': timestamp
|
||||
})
|
||||
|
||||
# Update in-memory counts for immediate reporting
|
||||
async with token_usage_lock:
|
||||
token_usage_counts[endpoint][model] += (prompt + comp)
|
||||
await publish_snapshot()
|
||||
# Update in-memory counts for immediate reporting
|
||||
async with token_usage_lock:
|
||||
token_usage_counts[endpoint][model] += (prompt + comp)
|
||||
await publish_snapshot()
|
||||
except asyncio.CancelledError:
|
||||
# Gracefully handle task cancellation during shutdown
|
||||
print("[token_worker] Task cancelled, processing remaining queue items...")
|
||||
# Process any remaining items in the queue before exiting
|
||||
while not token_queue.empty():
|
||||
try:
|
||||
endpoint, model, prompt, comp = token_queue.get_nowait()
|
||||
async with buffer_lock:
|
||||
token_buffer[endpoint][model] = (
|
||||
token_buffer[endpoint].get(model, (0, 0))[0] + prompt,
|
||||
token_buffer[endpoint].get(model, (0, 0))[1] + comp
|
||||
)
|
||||
now = datetime.now(tz=timezone.utc)
|
||||
timestamp = int(datetime(now.year, now.month, now.day, now.hour, now.minute, tzinfo=timezone.utc).timestamp())
|
||||
time_series_buffer.append({
|
||||
'endpoint': endpoint,
|
||||
'model': model,
|
||||
'input_tokens': prompt,
|
||||
'output_tokens': comp,
|
||||
'total_tokens': prompt + comp,
|
||||
'timestamp': timestamp
|
||||
})
|
||||
async with token_usage_lock:
|
||||
token_usage_counts[endpoint][model] += (prompt + comp)
|
||||
await publish_snapshot()
|
||||
except asyncio.QueueEmpty:
|
||||
break
|
||||
print("[token_worker] Task cancelled, remaining items processed.")
|
||||
raise
|
||||
|
||||
async def flush_buffer() -> None:
|
||||
"""Periodically flush accumulated token counts to the database."""
|
||||
while True:
|
||||
await asyncio.sleep(FLUSH_INTERVAL)
|
||||
try:
|
||||
while True:
|
||||
await asyncio.sleep(FLUSH_INTERVAL)
|
||||
|
||||
# Flush token counts
|
||||
if token_buffer:
|
||||
await db.update_batched_counts(token_buffer)
|
||||
token_buffer.clear()
|
||||
# Flush token counts and time series (protected by lock)
|
||||
async with buffer_lock:
|
||||
if token_buffer:
|
||||
# Copy buffer before releasing lock for DB operation
|
||||
buffer_copy = {ep: dict(models) for ep, models in token_buffer.items()}
|
||||
token_buffer.clear()
|
||||
else:
|
||||
buffer_copy = None
|
||||
|
||||
# Flush time series entries
|
||||
if time_series_buffer:
|
||||
await db.add_batched_time_series(time_series_buffer)
|
||||
time_series_buffer.clear()
|
||||
if time_series_buffer:
|
||||
ts_copy = list(time_series_buffer)
|
||||
time_series_buffer.clear()
|
||||
else:
|
||||
ts_copy = None
|
||||
|
||||
# Perform DB operations outside the lock to avoid blocking
|
||||
if buffer_copy:
|
||||
await db.update_batched_counts(buffer_copy)
|
||||
if ts_copy:
|
||||
await db.add_batched_time_series(ts_copy)
|
||||
except asyncio.CancelledError:
|
||||
# Gracefully handle task cancellation during shutdown
|
||||
print("[flush_buffer] Task cancelled, flushing remaining buffers...")
|
||||
# Flush any remaining data before exiting
|
||||
try:
|
||||
async with buffer_lock:
|
||||
if token_buffer:
|
||||
buffer_copy = {ep: dict(models) for ep, models in token_buffer.items()}
|
||||
token_buffer.clear()
|
||||
else:
|
||||
buffer_copy = None
|
||||
if time_series_buffer:
|
||||
ts_copy = list(time_series_buffer)
|
||||
time_series_buffer.clear()
|
||||
else:
|
||||
ts_copy = None
|
||||
if buffer_copy:
|
||||
await db.update_batched_counts(buffer_copy)
|
||||
if ts_copy:
|
||||
await db.add_batched_time_series(ts_copy)
|
||||
print("[flush_buffer] Task cancelled, remaining buffers flushed.")
|
||||
except Exception as e:
|
||||
print(f"[flush_buffer] Error during shutdown flush: {e}")
|
||||
raise
|
||||
|
||||
async def flush_remaining_buffers() -> None:
|
||||
"""
|
||||
|
|
@ -265,14 +337,24 @@ async def flush_remaining_buffers() -> None:
|
|||
"""
|
||||
try:
|
||||
flushed_entries = 0
|
||||
if token_buffer:
|
||||
await db.update_batched_counts(token_buffer)
|
||||
flushed_entries += sum(len(v) for v in token_buffer.values())
|
||||
token_buffer.clear()
|
||||
if time_series_buffer:
|
||||
await db.add_batched_time_series(time_series_buffer)
|
||||
flushed_entries += len(time_series_buffer)
|
||||
time_series_buffer.clear()
|
||||
async with buffer_lock:
|
||||
if token_buffer:
|
||||
buffer_copy = {ep: dict(models) for ep, models in token_buffer.items()}
|
||||
flushed_entries += sum(len(v) for v in token_buffer.values())
|
||||
token_buffer.clear()
|
||||
else:
|
||||
buffer_copy = None
|
||||
if time_series_buffer:
|
||||
ts_copy = list(time_series_buffer)
|
||||
flushed_entries += len(time_series_buffer)
|
||||
time_series_buffer.clear()
|
||||
else:
|
||||
ts_copy = None
|
||||
# Perform DB operations outside the lock
|
||||
if buffer_copy:
|
||||
await db.update_batched_counts(buffer_copy)
|
||||
if ts_copy:
|
||||
await db.add_batched_time_series(ts_copy)
|
||||
if flushed_entries:
|
||||
print(f"[shutdown] Flushed {flushed_entries} in-memory entries to DB on shutdown.")
|
||||
else:
|
||||
|
|
@ -884,7 +966,9 @@ async def choose_endpoint(model: str) -> str:
|
|||
]
|
||||
|
||||
if endpoints_with_free_slot:
|
||||
return random.choice(endpoints_with_free_slot)
|
||||
#return random.choice(endpoints_with_free_slot)
|
||||
endpoints_with_free_slot.sort(key=lambda ep: sum(usage_counts.get(ep, {}).values()))
|
||||
return endpoints_with_free_slot[0]
|
||||
|
||||
# 5️⃣ All candidate endpoints are saturated – pick one with lowest usages count (will queue)
|
||||
ep = min(candidate_endpoints, key=current_usage)
|
||||
|
|
@ -2053,7 +2137,13 @@ async def index(request: Request):
|
|||
Render the dynamic NOMYO Router dashboard listing the configured endpoints
|
||||
and the models details, availability & task status.
|
||||
"""
|
||||
return HTMLResponse(content=open("static/index.html", "r").read(), status_code=200)
|
||||
index_path = STATIC_DIR / "index.html"
|
||||
try:
|
||||
return HTMLResponse(content=index_path.read_text(encoding="utf-8"), status_code=200)
|
||||
except FileNotFoundError:
|
||||
raise HTTPException(status_code=404, detail="Page not found")
|
||||
except Exception:
|
||||
raise HTTPException(status_code=500, detail="Internal server error")
|
||||
|
||||
# -------------------------------------------------------------
|
||||
# 26. Healthendpoint
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue