feat:

added buffer_lock to prevent race condition in high concurrency scenarios added documentation
2026-01-05 17:16:31 +01:00 · 2026-01-05 17:16:31 +01:00 · 20a016269d
commit 20a016269d
parent 434b6d4cca
9 changed files with 2167 additions and 42 deletions
--- a/doc/monitoring.md
+++ b/doc/monitoring.md
@ -0,0 +1,515 @@
+# Monitoring and Troubleshooting Guide
+
+## Monitoring Overview
+
+NOMYO Router provides comprehensive monitoring capabilities to track performance, health, and usage patterns.
+
+## Monitoring Endpoints
+
+### Health Check
+
+```bash
+curl http://localhost:12434/health
+```
+
+Response:
+```json
+{
+  "status": "ok" | "error",
+  "endpoints": {
+    "http://endpoint1:11434": {
+      "status": "ok" | "error",
+      "version": "string" | "detail": "error message"
+    }
+  }
+}
+```
+
+**HTTP Status Codes**:
+- `200`: All endpoints healthy
+- `503`: One or more endpoints unhealthy
+
+### Current Usage
+
+```bash
+curl http://localhost:12434/api/usage
+```
+
+Response:
+```json
+{
+  "usage_counts": {
+    "http://endpoint1:11434": {
+      "llama3": 2,
+      "mistral": 1
+    },
+    "http://endpoint2:11434": {
+      "llama3": 0,
+      "mistral": 3
+    }
+  },
+  "token_usage_counts": {
+    "http://endpoint1:11434": {
+      "llama3": 1542,
+      "mistral": 876
+    }
+  }
+}
+```
+
+### Token Statistics
+
+```bash
+curl http://localhost:12434/api/token_counts
+```
+
+Response:
+```json
+{
+  "total_tokens": 2418,
+  "breakdown": [
+    {
+      "endpoint": "http://endpoint1:11434",
+      "model": "llama3",
+      "input_tokens": 120,
+      "output_tokens": 1422,
+      "total_tokens": 1542
+    },
+    {
+      "endpoint": "http://endpoint1:11434",
+      "model": "mistral",
+      "input_tokens": 80,
+      "output_tokens": 796,
+      "total_tokens": 876
+    }
+  ]
+}
+```
+
+### Model Statistics
+
+```bash
+curl http://localhost:12434/api/stats -X POST -d '{"model": "llama3"}'
+```
+
+Response:
+```json
+{
+  "model": "llama3",
+  "input_tokens": 120,
+  "output_tokens": 1422,
+  "total_tokens": 1542,
+  "time_series": [
+    {
+      "endpoint": "http://endpoint1:11434",
+      "timestamp": 1712345600,
+      "input_tokens": 20,
+      "output_tokens": 150,
+      "total_tokens": 170
+    }
+  ],
+  "endpoint_distribution": {
+    "http://endpoint1:11434": 1542
+  }
+}
+```
+
+### Configuration Status
+
+```bash
+curl http://localhost:12434/api/config
+```
+
+Response:
+```json
+{
+  "endpoints": [
+    {
+      "url": "http://endpoint1:11434",
+      "status": "ok" | "error",
+      "version": "string" | "detail": "error message"
+    }
+  ]
+}
+```
+
+### Real-time Usage Stream
+
+```bash
+curl http://localhost:12434/api/usage-stream
+```
+
+This provides Server-Sent Events (SSE) with real-time updates:
+```
+data: {"usage_counts": {...}, "token_usage_counts": {...}}
+
+data: {"usage_counts": {...}, "token_usage_counts": {...}}
+```
+
+## Monitoring Tools
+
+### Prometheus Integration
+
+Create a Prometheus scrape configuration:
+
+```yaml
+scrape_configs:
+  - job_name: 'nomyo-router'
+    metrics_path: '/api/usage'
+    params:
+      format: ['prometheus']
+    static_configs:
+      - targets: ['nomyo-router:12434']
+```
+
+### Grafana Dashboard
+
+Create a dashboard with these panels:
+- Endpoint health status
+- Current connection counts
+- Token usage (input/output/total)
+- Request rates
+- Response times
+- Error rates
+
+### Logging
+
+The router logs important events to stdout:
+- Configuration loading
+- Endpoint connection issues
+- Token counting operations
+- Error conditions
+
+## Troubleshooting Guide
+
+### Common Issues and Solutions
+
+#### 1. Endpoint Unavailable
+
+**Symptoms**:
+- Health check shows endpoint as "error"
+- Requests fail with connection errors
+
+**Diagnosis**:
+```bash
+curl http://localhost:12434/health
+curl http://localhost:12434/api/config
+```
+
+**Solutions**:
+- Verify Ollama endpoint is running
+- Check network connectivity
+- Verify firewall rules
+- Check DNS resolution
+- Test direct connection: `curl http://endpoint:11434/api/version`
+
+#### 2. Model Not Found
+
+**Symptoms**:
+- Error: "None of the configured endpoints advertise the model"
+- Requests fail with model not found
+
+**Diagnosis**:
+```bash
+curl http://localhost:12434/api/tags
+curl http://endpoint:11434/api/tags
+```
+
+**Solutions**:
+- Pull the model on the endpoint: `curl http://endpoint:11434/api/pull -d '{"name": "llama3"}'`
+- Verify model name spelling
+- Check if model is available on any endpoint
+- For OpenAI endpoints, ensure model exists in their catalog
+
+#### 3. High Latency
+
+**Symptoms**:
+- Slow response times
+- Requests timing out
+
+**Diagnosis**:
+```bash
+curl http://localhost:12434/api/usage
+curl http://localhost:12434/api/config
+```
+
+**Solutions**:
+- Check if endpoints are overloaded (high connection counts)
+- Increase `max_concurrent_connections`
+- Add more endpoints to the cluster
+- Monitor Ollama endpoint performance
+- Check network latency between router and endpoints
+
+#### 4. Connection Limits Reached
+
+**Symptoms**:
+- Requests queuing
+- High connection counts
+- Slow response times
+
+**Diagnosis**:
+```bash
+curl http://localhost:12434/api/usage
+```
+
+**Solutions**:
+- Increase `max_concurrent_connections` in config.yaml
+- Add more Ollama endpoints
+- Scale your Ollama cluster
+- Use MOE system for critical queries
+
+#### 5. Token Tracking Not Working
+
+**Symptoms**:
+- Token counts not updating
+- Database errors
+
+**Diagnosis**:
+```bash
+ls -la token_counts.db
+curl http://localhost:12434/api/token_counts
+```
+
+**Solutions**:
+- Verify database file permissions
+- Check if database path is writable
+- Restart router to rebuild database
+- Check disk space
+- Verify environment variable `NOMYO_ROUTER_DB_PATH`
+
+#### 6. Streaming Issues
+
+**Symptoms**:
+- Incomplete responses
+- Connection resets during streaming
+- Timeout errors
+
+**Diagnosis**:
+- Check router logs for errors
+- Test with non-streaming requests
+- Monitor connection counts
+
+**Solutions**:
+- Increase timeout settings
+- Reduce `max_concurrent_connections`
+- Check network stability
+- Test with smaller payloads
+
+### Error Messages
+
+#### "Failed to connect to endpoint"
+
+**Cause**: Network connectivity issue
+**Action**: Verify endpoint is reachable, check firewall, test DNS
+
+#### "None of the configured endpoints advertise the model"
+
+**Cause**: Model not pulled on any endpoint
+**Action**: Pull the model, verify model name
+
+#### "Timed out waiting for endpoint"
+
+**Cause**: Endpoint slow to respond
+**Action**: Check endpoint health, increase timeouts
+
+#### "Invalid JSON format in request body"
+
+**Cause**: Malformed request
+**Action**: Validate request payload, check API documentation
+
+#### "Missing required field 'model'"
+
+**Cause**: Request missing model parameter
+**Action**: Add model parameter to request
+
+### Performance Tuning
+
+#### Optimizing Connection Handling
+
+1. **Adjust concurrency limits**:
+   ```yaml
+   max_concurrent_connections: 4
+   ```
+
+2. **Monitor connection usage**:
+   ```bash
+   curl http://localhost:12434/api/usage
+   ```
+
+3. **Scale horizontally**:
+   - Add more Ollama endpoints
+   - Deploy multiple router instances
+
+#### Reducing Latency
+
+1. **Keep models loaded**:
+   - Use frequently accessed models
+   - Monitor `/api/ps` for loaded models
+
+2. **Optimize endpoint selection**:
+   - Distribute models across endpoints
+   - Balance load evenly
+
+3. **Use caching**:
+   - Models cache (300s TTL)
+   - Loaded models cache (30s TTL)
+
+#### Memory Management
+
+1. **Monitor memory usage**:
+   - Token buffer grows with usage
+   - Time-series data accumulates
+
+2. **Adjust flush interval**:
+   - Default: 10 seconds
+   - Can be increased for less frequent I/O
+
+3. **Database maintenance**:
+   - Regular backups
+   - Archive old data periodically
+
+### Database Management
+
+#### Viewing Token Data
+
+```bash
+sqlite3 token_counts.db "SELECT * FROM token_counts;"
+sqlite3 token_counts.db "SELECT * FROM time_series LIMIT 100;"
+```
+
+#### Aggregating Old Data
+
+```bash
+curl http://localhost:12434/api/aggregate_time_series_days \
+  -X POST \
+  -d '{"days": 30, "trim_old": true}'
+```
+
+#### Backing Up Database
+
+```bash
+cp token_counts.db token_counts.db.backup
+```
+
+#### Restoring Database
+
+```bash
+cp token_counts.db.backup token_counts.db
+```
+
+### Advanced Troubleshooting
+
+#### Debugging Endpoint Selection
+
+Enable debug logging to see endpoint selection decisions:
+```bash
+uvicorn router:app --host 0.0.0.0 --port 12434 --log-level debug
+```
+
+#### Testing Individual Endpoints
+
+```bash
+# Test endpoint directly
+curl http://endpoint:11434/api/version
+
+# Test model availability
+curl http://endpoint:11434/api/tags
+
+# Test model loading
+curl http://endpoint:11434/api/ps
+```
+
+#### Network Diagnostics
+
+```bash
+# Test connectivity
+nc -zv endpoint 11434
+
+# Test DNS resolution
+dig endpoint
+
+# Test latency
+ping endpoint
+```
+
+### Common Pitfalls
+
+1. **Using localhost in Docker**:
+   - Inside Docker, `localhost` refers to the container itself
+   - Use `host.docker.internal` or Docker service names
+
+2. **Incorrect model names**:
+   - Ollama: `llama3:latest`
+   - OpenAI: `gpt-4` (no version suffix)
+
+3. **Missing API keys**:
+   - Remote endpoints require authentication
+   - Set keys in config.yaml or environment variables
+
+4. **Firewall blocking**:
+   - Ensure port 11434 is open for Ollama
+   - Ensure port 12434 is open for router
+
+5. **Insufficient resources**:
+   - Monitor CPU/memory on Ollama endpoints
+   - Adjust `max_concurrent_connections` accordingly
+
+## Best Practices
+
+### Monitoring Setup
+
+1. **Set up health checks**:
+   - Monitor `/health` endpoint
+   - Alert on status "error"
+
+2. **Track usage metrics**:
+   - Monitor connection counts
+   - Track token usage
+   - Watch for connection limits
+
+3. **Log important events**:
+   - Configuration changes
+   - Endpoint failures
+   - Recovery events
+
+4. **Regular backups**:
+   - Backup token_counts.db
+   - Schedule regular backups
+   - Test restore procedure
+
+### Performance Monitoring
+
+1. **Baseline metrics**:
+   - Establish normal usage patterns
+   - Track trends over time
+
+2. **Alert thresholds**:
+   - Set alerts for high connection counts
+   - Monitor error rates
+   - Watch for latency spikes
+
+3. **Capacity planning**:
+   - Track growth in token usage
+   - Plan for scaling needs
+   - Monitor resource utilization
+
+### Incident Response
+
+1. **Quick diagnosis**:
+   - Check health endpoint first
+   - Review recent logs
+   - Verify endpoint status
+
+2. **Isolation**:
+   - Identify affected endpoints
+   - Isolate problematic components
+   - Fallback to healthy endpoints
+
+3. **Recovery**:
+   - Restart router if needed
+   - Rebalance load
+   - Restore from backup if necessary
+
+## Examples
+
+See the [examples](examples/) directory for monitoring configuration examples.