nomyo-router/doc/monitoring.md

548 lines
10 KiB
Markdown

# Monitoring and Troubleshooting Guide
## Monitoring Overview
NOMYO Router provides comprehensive monitoring capabilities to track performance, health, and usage patterns.
## Monitoring Endpoints
### Health Check
```bash
curl http://localhost:12434/health
```
Response:
```json
{
"status": "ok" | "error",
"endpoints": {
"http://endpoint1:11434": {
"status": "ok" | "error",
"version": "string" | "detail": "error message"
}
}
}
```
**HTTP Status Codes**:
- `200`: All endpoints healthy
- `503`: One or more endpoints unhealthy
### Current Usage
```bash
curl http://localhost:12434/api/usage
```
Response:
```json
{
"usage_counts": {
"http://endpoint1:11434": {
"llama3": 2,
"mistral": 1
},
"http://endpoint2:11434": {
"llama3": 0,
"mistral": 3
}
},
"token_usage_counts": {
"http://endpoint1:11434": {
"llama3": 1542,
"mistral": 876
}
}
}
```
### Token Statistics
```bash
curl http://localhost:12434/api/token_counts
```
Response:
```json
{
"total_tokens": 2418,
"breakdown": [
{
"endpoint": "http://endpoint1:11434",
"model": "llama3",
"input_tokens": 120,
"output_tokens": 1422,
"total_tokens": 1542
},
{
"endpoint": "http://endpoint1:11434",
"model": "mistral",
"input_tokens": 80,
"output_tokens": 796,
"total_tokens": 876
}
]
}
```
### Model Statistics
```bash
curl http://localhost:12434/api/stats -X POST -d '{"model": "llama3"}'
```
Response:
```json
{
"model": "llama3",
"input_tokens": 120,
"output_tokens": 1422,
"total_tokens": 1542,
"time_series": [
{
"endpoint": "http://endpoint1:11434",
"timestamp": 1712345600,
"input_tokens": 20,
"output_tokens": 150,
"total_tokens": 170
}
],
"endpoint_distribution": {
"http://endpoint1:11434": 1542
}
}
```
### Configuration Status
```bash
curl http://localhost:12434/api/config
```
Response:
```json
{
"endpoints": [
{
"url": "http://endpoint1:11434",
"status": "ok" | "error",
"version": "string" | "detail": "error message"
}
]
}
```
### Cache Statistics
```bash
curl http://localhost:12434/api/cache/stats
```
Response when cache is enabled:
```json
{
"enabled": true,
"hits": 1547,
"misses": 892,
"hit_rate": 0.634,
"semantic": true,
"backend": "sqlite",
"similarity_threshold": 0.9,
"history_weight": 0.3
}
```
Response when cache is disabled:
```json
{ "enabled": false }
```
### Cache Invalidation
```bash
curl -X POST http://localhost:12434/api/cache/invalidate
```
Clears all cached entries and resets hit/miss counters.
### Real-time Usage Stream
```bash
curl http://localhost:12434/api/usage-stream
```
This provides Server-Sent Events (SSE) with real-time updates:
```
data: {"usage_counts": {...}, "token_usage_counts": {...}}
data: {"usage_counts": {...}, "token_usage_counts": {...}}
```
## Monitoring Tools
### Prometheus Integration
Create a Prometheus scrape configuration:
```yaml
scrape_configs:
- job_name: 'nomyo-router'
metrics_path: '/api/usage'
params:
format: ['prometheus']
static_configs:
- targets: ['nomyo-router:12434']
```
### Grafana Dashboard
Create a dashboard with these panels:
- Endpoint health status
- Current connection counts
- Token usage (input/output/total)
- Request rates
- Response times
- Error rates
### Logging
The router logs important events to stdout:
- Configuration loading
- Endpoint connection issues
- Token counting operations
- Error conditions
## Troubleshooting Guide
### Common Issues and Solutions
#### 1. Endpoint Unavailable
**Symptoms**:
- Health check shows endpoint as "error"
- Requests fail with connection errors
**Diagnosis**:
```bash
curl http://localhost:12434/health
curl http://localhost:12434/api/config
```
**Solutions**:
- Verify Ollama endpoint is running
- Check network connectivity
- Verify firewall rules
- Check DNS resolution
- Test direct connection: `curl http://endpoint:11434/api/version`
#### 2. Model Not Found
**Symptoms**:
- Error: "None of the configured endpoints advertise the model"
- Requests fail with model not found
**Diagnosis**:
```bash
curl http://localhost:12434/api/tags
curl http://endpoint:11434/api/tags
```
**Solutions**:
- Pull the model on the endpoint: `curl http://endpoint:11434/api/pull -d '{"name": "llama3"}'`
- Verify model name spelling
- Check if model is available on any endpoint
- For OpenAI endpoints, ensure model exists in their catalog
#### 3. High Latency
**Symptoms**:
- Slow response times
- Requests timing out
**Diagnosis**:
```bash
curl http://localhost:12434/api/usage
curl http://localhost:12434/api/config
```
**Solutions**:
- Check if endpoints are overloaded (high connection counts)
- Increase `max_concurrent_connections`
- Add more endpoints to the cluster
- Monitor Ollama endpoint performance
- Check network latency between router and endpoints
#### 4. Connection Limits Reached
**Symptoms**:
- Requests queuing
- High connection counts
- Slow response times
**Diagnosis**:
```bash
curl http://localhost:12434/api/usage
```
**Solutions**:
- Increase `max_concurrent_connections` in config.yaml
- Add more Ollama endpoints
- Scale your Ollama cluster
- Use MOE system for critical queries
#### 5. Token Tracking Not Working
**Symptoms**:
- Token counts not updating
- Database errors
**Diagnosis**:
```bash
ls -la token_counts.db
curl http://localhost:12434/api/token_counts
```
**Solutions**:
- Verify database file permissions
- Check if database path is writable
- Restart router to rebuild database
- Check disk space
- Verify environment variable `NOMYO_ROUTER_DB_PATH`
#### 6. Streaming Issues
**Symptoms**:
- Incomplete responses
- Connection resets during streaming
- Timeout errors
**Diagnosis**:
- Check router logs for errors
- Test with non-streaming requests
- Monitor connection counts
**Solutions**:
- Increase timeout settings
- Reduce `max_concurrent_connections`
- Check network stability
- Test with smaller payloads
### Error Messages
#### "Failed to connect to endpoint"
**Cause**: Network connectivity issue
**Action**: Verify endpoint is reachable, check firewall, test DNS
#### "None of the configured endpoints advertise the model"
**Cause**: Model not pulled on any endpoint
**Action**: Pull the model, verify model name
#### "Timed out waiting for endpoint"
**Cause**: Endpoint slow to respond
**Action**: Check endpoint health, increase timeouts
#### "Invalid JSON format in request body"
**Cause**: Malformed request
**Action**: Validate request payload, check API documentation
#### "Missing required field 'model'"
**Cause**: Request missing model parameter
**Action**: Add model parameter to request
### Performance Tuning
#### Optimizing Connection Handling
1. **Adjust concurrency limits**:
```yaml
max_concurrent_connections: 4
```
2. **Monitor connection usage**:
```bash
curl http://localhost:12434/api/usage
```
3. **Scale horizontally**:
- Add more Ollama endpoints
- Deploy multiple router instances
#### Reducing Latency
1. **Keep models loaded**:
- Use frequently accessed models
- Monitor `/api/ps` for loaded models
2. **Optimize endpoint selection**:
- Distribute models across endpoints
- Balance load evenly
3. **Use caching**:
- Models cache (300s TTL)
- Loaded models cache (30s TTL)
#### Memory Management
1. **Monitor memory usage**:
- Token buffer grows with usage
- Time-series data accumulates
2. **Adjust flush interval**:
- Default: 10 seconds
- Can be increased for less frequent I/O
3. **Database maintenance**:
- Regular backups
- Archive old data periodically
### Database Management
#### Viewing Token Data
```bash
sqlite3 token_counts.db "SELECT * FROM token_counts;"
sqlite3 token_counts.db "SELECT * FROM time_series LIMIT 100;"
```
#### Aggregating Old Data
```bash
curl http://localhost:12434/api/aggregate_time_series_days \
-X POST \
-d '{"days": 30, "trim_old": true}'
```
#### Backing Up Database
```bash
cp token_counts.db token_counts.db.backup
```
#### Restoring Database
```bash
cp token_counts.db.backup token_counts.db
```
### Advanced Troubleshooting
#### Debugging Endpoint Selection
Enable debug logging to see endpoint selection decisions:
```bash
uvicorn router:app --host 0.0.0.0 --port 12434 --log-level debug
```
#### Testing Individual Endpoints
```bash
# Test endpoint directly
curl http://endpoint:11434/api/version
# Test model availability
curl http://endpoint:11434/api/tags
# Test model loading
curl http://endpoint:11434/api/ps
```
#### Network Diagnostics
```bash
# Test connectivity
nc -zv endpoint 11434
# Test DNS resolution
dig endpoint
# Test latency
ping endpoint
```
### Common Pitfalls
1. **Using localhost in Docker**:
- Inside Docker, `localhost` refers to the container itself
- Use `host.docker.internal` or Docker service names
2. **Incorrect model names**:
- Ollama: `llama3:latest`
- OpenAI: `gpt-4` (no version suffix)
3. **Missing API keys**:
- Remote endpoints require authentication
- Set keys in config.yaml or environment variables
4. **Firewall blocking**:
- Ensure port 11434 is open for Ollama
- Ensure port 12434 is open for router
5. **Insufficient resources**:
- Monitor CPU/memory on Ollama endpoints
- Adjust `max_concurrent_connections` accordingly
## Best Practices
### Monitoring Setup
1. **Set up health checks**:
- Monitor `/health` endpoint
- Alert on status "error"
2. **Track usage metrics**:
- Monitor connection counts
- Track token usage
- Watch for connection limits
3. **Log important events**:
- Configuration changes
- Endpoint failures
- Recovery events
4. **Regular backups**:
- Backup token_counts.db
- Schedule regular backups
- Test restore procedure
### Performance Monitoring
1. **Baseline metrics**:
- Establish normal usage patterns
- Track trends over time
2. **Alert thresholds**:
- Set alerts for high connection counts
- Monitor error rates
- Watch for latency spikes
3. **Capacity planning**:
- Track growth in token usage
- Plan for scaling needs
- Monitor resource utilization
### Incident Response
1. **Quick diagnosis**:
- Check health endpoint first
- Review recent logs
- Verify endpoint status
2. **Isolation**:
- Identify affected endpoints
- Isolate problematic components
- Fallback to healthy endpoints
3. **Recovery**:
- Restart router if needed
- Rebalance load
- Restore from backup if necessary
## Examples
See the [examples](examples/) directory for monitoring configuration examples.