10 KiB
Monitoring and Troubleshooting Guide
Monitoring Overview
NOMYO Router provides comprehensive monitoring capabilities to track performance, health, and usage patterns.
Monitoring Endpoints
Health Check
curl http://localhost:12434/health
Response:
{
"status": "ok" | "error",
"endpoints": {
"http://endpoint1:11434": {
"status": "ok" | "error",
"version": "string" | "detail": "error message"
}
}
}
HTTP Status Codes:
200: All endpoints healthy503: One or more endpoints unhealthy
Current Usage
curl http://localhost:12434/api/usage
Response:
{
"usage_counts": {
"http://endpoint1:11434": {
"llama3": 2,
"mistral": 1
},
"http://endpoint2:11434": {
"llama3": 0,
"mistral": 3
}
},
"token_usage_counts": {
"http://endpoint1:11434": {
"llama3": 1542,
"mistral": 876
}
}
}
Token Statistics
curl http://localhost:12434/api/token_counts
Response:
{
"total_tokens": 2418,
"breakdown": [
{
"endpoint": "http://endpoint1:11434",
"model": "llama3",
"input_tokens": 120,
"output_tokens": 1422,
"total_tokens": 1542
},
{
"endpoint": "http://endpoint1:11434",
"model": "mistral",
"input_tokens": 80,
"output_tokens": 796,
"total_tokens": 876
}
]
}
Model Statistics
curl http://localhost:12434/api/stats -X POST -d '{"model": "llama3"}'
Response:
{
"model": "llama3",
"input_tokens": 120,
"output_tokens": 1422,
"total_tokens": 1542,
"time_series": [
{
"endpoint": "http://endpoint1:11434",
"timestamp": 1712345600,
"input_tokens": 20,
"output_tokens": 150,
"total_tokens": 170
}
],
"endpoint_distribution": {
"http://endpoint1:11434": 1542
}
}
Configuration Status
curl http://localhost:12434/api/config
Response:
{
"endpoints": [
{
"url": "http://endpoint1:11434",
"status": "ok" | "error",
"version": "string" | "detail": "error message"
}
]
}
Cache Statistics
curl http://localhost:12434/api/cache/stats
Response when cache is enabled:
{
"enabled": true,
"hits": 1547,
"misses": 892,
"hit_rate": 0.634,
"semantic": true,
"backend": "sqlite",
"similarity_threshold": 0.9,
"history_weight": 0.3
}
Response when cache is disabled:
{ "enabled": false }
Cache Invalidation
curl -X POST http://localhost:12434/api/cache/invalidate
Clears all cached entries and resets hit/miss counters.
Real-time Usage Stream
curl http://localhost:12434/api/usage-stream
This provides Server-Sent Events (SSE) with real-time updates:
data: {"usage_counts": {...}, "token_usage_counts": {...}}
data: {"usage_counts": {...}, "token_usage_counts": {...}}
Monitoring Tools
Prometheus Integration
Create a Prometheus scrape configuration:
scrape_configs:
- job_name: 'nomyo-router'
metrics_path: '/api/usage'
params:
format: ['prometheus']
static_configs:
- targets: ['nomyo-router:12434']
Grafana Dashboard
Create a dashboard with these panels:
- Endpoint health status
- Current connection counts
- Token usage (input/output/total)
- Request rates
- Response times
- Error rates
Logging
The router logs important events to stdout:
- Configuration loading
- Endpoint connection issues
- Token counting operations
- Error conditions
Troubleshooting Guide
Common Issues and Solutions
1. Endpoint Unavailable
Symptoms:
- Health check shows endpoint as "error"
- Requests fail with connection errors
Diagnosis:
curl http://localhost:12434/health
curl http://localhost:12434/api/config
Solutions:
- Verify Ollama endpoint is running
- Check network connectivity
- Verify firewall rules
- Check DNS resolution
- Test direct connection:
curl http://endpoint:11434/api/version
2. Model Not Found
Symptoms:
- Error: "None of the configured endpoints advertise the model"
- Requests fail with model not found
Diagnosis:
curl http://localhost:12434/api/tags
curl http://endpoint:11434/api/tags
Solutions:
- Pull the model on the endpoint:
curl http://endpoint:11434/api/pull -d '{"name": "llama3"}' - Verify model name spelling
- Check if model is available on any endpoint
- For OpenAI endpoints, ensure model exists in their catalog
3. High Latency
Symptoms:
- Slow response times
- Requests timing out
Diagnosis:
curl http://localhost:12434/api/usage
curl http://localhost:12434/api/config
Solutions:
- Check if endpoints are overloaded (high connection counts)
- Increase
max_concurrent_connections - Add more endpoints to the cluster
- Monitor Ollama endpoint performance
- Check network latency between router and endpoints
4. Connection Limits Reached
Symptoms:
- Requests queuing
- High connection counts
- Slow response times
Diagnosis:
curl http://localhost:12434/api/usage
Solutions:
- Increase
max_concurrent_connectionsin config.yaml - Add more Ollama endpoints
- Scale your Ollama cluster
- Use MOE system for critical queries
5. Token Tracking Not Working
Symptoms:
- Token counts not updating
- Database errors
Diagnosis:
ls -la token_counts.db
curl http://localhost:12434/api/token_counts
Solutions:
- Verify database file permissions
- Check if database path is writable
- Restart router to rebuild database
- Check disk space
- Verify environment variable
NOMYO_ROUTER_DB_PATH
6. Streaming Issues
Symptoms:
- Incomplete responses
- Connection resets during streaming
- Timeout errors
Diagnosis:
- Check router logs for errors
- Test with non-streaming requests
- Monitor connection counts
Solutions:
- Increase timeout settings
- Reduce
max_concurrent_connections - Check network stability
- Test with smaller payloads
Error Messages
"Failed to connect to endpoint"
Cause: Network connectivity issue Action: Verify endpoint is reachable, check firewall, test DNS
"None of the configured endpoints advertise the model"
Cause: Model not pulled on any endpoint Action: Pull the model, verify model name
"Timed out waiting for endpoint"
Cause: Endpoint slow to respond Action: Check endpoint health, increase timeouts
"Invalid JSON format in request body"
Cause: Malformed request Action: Validate request payload, check API documentation
"Missing required field 'model'"
Cause: Request missing model parameter Action: Add model parameter to request
Performance Tuning
Optimizing Connection Handling
-
Adjust concurrency limits:
max_concurrent_connections: 4 -
Monitor connection usage:
curl http://localhost:12434/api/usage -
Scale horizontally:
- Add more Ollama endpoints
- Deploy multiple router instances
Reducing Latency
-
Keep models loaded:
- Use frequently accessed models
- Monitor
/api/psfor loaded models
-
Optimize endpoint selection:
- Distribute models across endpoints
- Balance load evenly
-
Use caching:
- Models cache (300s TTL)
- Loaded models cache (30s TTL)
Memory Management
-
Monitor memory usage:
- Token buffer grows with usage
- Time-series data accumulates
-
Adjust flush interval:
- Default: 10 seconds
- Can be increased for less frequent I/O
-
Database maintenance:
- Regular backups
- Archive old data periodically
Database Management
Viewing Token Data
sqlite3 token_counts.db "SELECT * FROM token_counts;"
sqlite3 token_counts.db "SELECT * FROM time_series LIMIT 100;"
Aggregating Old Data
curl http://localhost:12434/api/aggregate_time_series_days \
-X POST \
-d '{"days": 30, "trim_old": true}'
Backing Up Database
cp token_counts.db token_counts.db.backup
Restoring Database
cp token_counts.db.backup token_counts.db
Advanced Troubleshooting
Debugging Endpoint Selection
Enable debug logging to see endpoint selection decisions:
uvicorn router:app --host 0.0.0.0 --port 12434 --log-level debug
Testing Individual Endpoints
# Test endpoint directly
curl http://endpoint:11434/api/version
# Test model availability
curl http://endpoint:11434/api/tags
# Test model loading
curl http://endpoint:11434/api/ps
Network Diagnostics
# Test connectivity
nc -zv endpoint 11434
# Test DNS resolution
dig endpoint
# Test latency
ping endpoint
Common Pitfalls
-
Using localhost in Docker:
- Inside Docker,
localhostrefers to the container itself - Use
host.docker.internalor Docker service names
- Inside Docker,
-
Incorrect model names:
- Ollama:
llama3:latest - OpenAI:
gpt-4(no version suffix)
- Ollama:
-
Missing API keys:
- Remote endpoints require authentication
- Set keys in config.yaml or environment variables
-
Firewall blocking:
- Ensure port 11434 is open for Ollama
- Ensure port 12434 is open for router
-
Insufficient resources:
- Monitor CPU/memory on Ollama endpoints
- Adjust
max_concurrent_connectionsaccordingly
Best Practices
Monitoring Setup
-
Set up health checks:
- Monitor
/healthendpoint - Alert on status "error"
- Monitor
-
Track usage metrics:
- Monitor connection counts
- Track token usage
- Watch for connection limits
-
Log important events:
- Configuration changes
- Endpoint failures
- Recovery events
-
Regular backups:
- Backup token_counts.db
- Schedule regular backups
- Test restore procedure
Performance Monitoring
-
Baseline metrics:
- Establish normal usage patterns
- Track trends over time
-
Alert thresholds:
- Set alerts for high connection counts
- Monitor error rates
- Watch for latency spikes
-
Capacity planning:
- Track growth in token usage
- Plan for scaling needs
- Monitor resource utilization
Incident Response
-
Quick diagnosis:
- Check health endpoint first
- Review recent logs
- Verify endpoint status
-
Isolation:
- Identify affected endpoints
- Isolate problematic components
- Fallback to healthy endpoints
-
Recovery:
- Restart router if needed
- Rebalance load
- Restore from backup if necessary
Examples
See the examples directory for monitoring configuration examples.