nomyo-router/doc/monitoring.md

# Monitoring and Troubleshooting Guide

## Monitoring Overview

NOMYO Router provides comprehensive monitoring capabilities to track performance, health, and usage patterns.

## Monitoring Endpoints

### Health Check

```bash
curl http://localhost:12434/health
```

Response:
```json
{
  "status": "ok" | "error",
  "endpoints": {
    "http://endpoint1:11434": {
      "status": "ok" | "error",
      "version": "string" | "detail": "error message"
    }
  }
}
```

**HTTP Status Codes**:
- `200`: All endpoints healthy
- `503`: One or more endpoints unhealthy

### Current Usage

```bash
curl http://localhost:12434/api/usage
```

Response:
```json
{
  "usage_counts": {
    "http://endpoint1:11434": {
      "llama3": 2,
      "mistral": 1
    },
    "http://endpoint2:11434": {
      "llama3": 0,
      "mistral": 3
    }
  },
  "token_usage_counts": {
    "http://endpoint1:11434": {
      "llama3": 1542,
      "mistral": 876
    }
  }
}
```

### Token Statistics

```bash
curl http://localhost:12434/api/token_counts
```

Response:
```json
{
  "total_tokens": 2418,
  "breakdown": [
    {
      "endpoint": "http://endpoint1:11434",
      "model": "llama3",
      "input_tokens": 120,
      "output_tokens": 1422,
      "total_tokens": 1542
    },
    {
      "endpoint": "http://endpoint1:11434",
      "model": "mistral",
      "input_tokens": 80,
      "output_tokens": 796,
      "total_tokens": 876
    }
  ]
}
```

### Model Statistics

```bash
curl http://localhost:12434/api/stats -X POST -d '{"model": "llama3"}'
```

Response:
```json
{
  "model": "llama3",
  "input_tokens": 120,
  "output_tokens": 1422,
  "total_tokens": 1542,
  "time_series": [
    {
      "endpoint": "http://endpoint1:11434",
      "timestamp": 1712345600,
      "input_tokens": 20,
      "output_tokens": 150,
      "total_tokens": 170
    }
  ],
  "endpoint_distribution": {
    "http://endpoint1:11434": 1542
  }
}
```

### Configuration Status

```bash
curl http://localhost:12434/api/config
```

Response:
```json
{
  "endpoints": [
    {
      "url": "http://endpoint1:11434",
      "status": "ok" | "error",
      "version": "string" | "detail": "error message"
    }
  ]
}
```

### Real-time Usage Stream

```bash
curl http://localhost:12434/api/usage-stream
```

This provides Server-Sent Events (SSE) with real-time updates:
```
data: {"usage_counts": {...}, "token_usage_counts": {...}}

data: {"usage_counts": {...}, "token_usage_counts": {...}}
```

## Monitoring Tools

### Prometheus Integration

Create a Prometheus scrape configuration:

```yaml
scrape_configs:
  - job_name: 'nomyo-router'
    metrics_path: '/api/usage'
    params:
      format: ['prometheus']
    static_configs:
      - targets: ['nomyo-router:12434']
```

### Grafana Dashboard

Create a dashboard with these panels:
- Endpoint health status
- Current connection counts
- Token usage (input/output/total)
- Request rates
- Response times
- Error rates

### Logging

The router logs important events to stdout:
- Configuration loading
- Endpoint connection issues
- Token counting operations
- Error conditions

## Troubleshooting Guide

### Common Issues and Solutions

#### 1. Endpoint Unavailable

**Symptoms**:
- Health check shows endpoint as "error"
- Requests fail with connection errors

**Diagnosis**:
```bash
curl http://localhost:12434/health
curl http://localhost:12434/api/config
```

**Solutions**:
- Verify Ollama endpoint is running
- Check network connectivity
- Verify firewall rules
- Check DNS resolution
- Test direct connection: `curl http://endpoint:11434/api/version`

#### 2. Model Not Found

**Symptoms**:
- Error: "None of the configured endpoints advertise the model"
- Requests fail with model not found

**Diagnosis**:
```bash
curl http://localhost:12434/api/tags
curl http://endpoint:11434/api/tags
```

**Solutions**:
- Pull the model on the endpoint: `curl http://endpoint:11434/api/pull -d '{"name": "llama3"}'`
- Verify model name spelling
- Check if model is available on any endpoint
- For OpenAI endpoints, ensure model exists in their catalog

#### 3. High Latency

**Symptoms**:
- Slow response times
- Requests timing out

**Diagnosis**:
```bash
curl http://localhost:12434/api/usage
curl http://localhost:12434/api/config
```

**Solutions**:
- Check if endpoints are overloaded (high connection counts)
- Increase `max_concurrent_connections`
- Add more endpoints to the cluster
- Monitor Ollama endpoint performance
- Check network latency between router and endpoints

#### 4. Connection Limits Reached

**Symptoms**:
- Requests queuing
- High connection counts
- Slow response times

**Diagnosis**:
```bash
curl http://localhost:12434/api/usage
```

**Solutions**:
- Increase `max_concurrent_connections` in config.yaml
- Add more Ollama endpoints
- Scale your Ollama cluster
- Use MOE system for critical queries

#### 5. Token Tracking Not Working

**Symptoms**:
- Token counts not updating
- Database errors

**Diagnosis**:
```bash
ls -la token_counts.db
curl http://localhost:12434/api/token_counts
```

**Solutions**:
- Verify database file permissions
- Check if database path is writable
- Restart router to rebuild database
- Check disk space
- Verify environment variable `NOMYO_ROUTER_DB_PATH`

#### 6. Streaming Issues

**Symptoms**:
- Incomplete responses
- Connection resets during streaming
- Timeout errors

**Diagnosis**:
- Check router logs for errors
- Test with non-streaming requests
- Monitor connection counts

**Solutions**:
- Increase timeout settings
- Reduce `max_concurrent_connections`
- Check network stability
- Test with smaller payloads

### Error Messages

#### "Failed to connect to endpoint"

**Cause**: Network connectivity issue
**Action**: Verify endpoint is reachable, check firewall, test DNS

#### "None of the configured endpoints advertise the model"

**Cause**: Model not pulled on any endpoint
**Action**: Pull the model, verify model name

#### "Timed out waiting for endpoint"

**Cause**: Endpoint slow to respond
**Action**: Check endpoint health, increase timeouts

#### "Invalid JSON format in request body"

**Cause**: Malformed request
**Action**: Validate request payload, check API documentation

#### "Missing required field 'model'"

**Cause**: Request missing model parameter
**Action**: Add model parameter to request

### Performance Tuning

#### Optimizing Connection Handling

1. **Adjust concurrency limits**:
   ```yaml
   max_concurrent_connections: 4
   ```

2. **Monitor connection usage**:
   ```bash
   curl http://localhost:12434/api/usage
   ```

3. **Scale horizontally**:
   - Add more Ollama endpoints
   - Deploy multiple router instances

#### Reducing Latency

1. **Keep models loaded**:
   - Use frequently accessed models
   - Monitor `/api/ps` for loaded models

2. **Optimize endpoint selection**:
   - Distribute models across endpoints
   - Balance load evenly

3. **Use caching**:
   - Models cache (300s TTL)
   - Loaded models cache (30s TTL)

#### Memory Management

1. **Monitor memory usage**:
   - Token buffer grows with usage
   - Time-series data accumulates

2. **Adjust flush interval**:
   - Default: 10 seconds
   - Can be increased for less frequent I/O

3. **Database maintenance**:
   - Regular backups
   - Archive old data periodically

### Database Management

#### Viewing Token Data

```bash
sqlite3 token_counts.db "SELECT * FROM token_counts;"
sqlite3 token_counts.db "SELECT * FROM time_series LIMIT 100;"
```

#### Aggregating Old Data

```bash
curl http://localhost:12434/api/aggregate_time_series_days \
  -X POST \
  -d '{"days": 30, "trim_old": true}'
```

#### Backing Up Database

```bash
cp token_counts.db token_counts.db.backup
```

#### Restoring Database

```bash
cp token_counts.db.backup token_counts.db
```

### Advanced Troubleshooting

#### Debugging Endpoint Selection

Enable debug logging to see endpoint selection decisions:
```bash
uvicorn router:app --host 0.0.0.0 --port 12434 --log-level debug
```

#### Testing Individual Endpoints

```bash
# Test endpoint directly
curl http://endpoint:11434/api/version

# Test model availability
curl http://endpoint:11434/api/tags

# Test model loading
curl http://endpoint:11434/api/ps
```

#### Network Diagnostics

```bash
# Test connectivity
nc -zv endpoint 11434

# Test DNS resolution
dig endpoint

# Test latency
ping endpoint
```

### Common Pitfalls

1. **Using localhost in Docker**:
   - Inside Docker, `localhost` refers to the container itself
   - Use `host.docker.internal` or Docker service names

2. **Incorrect model names**:
   - Ollama: `llama3:latest`
   - OpenAI: `gpt-4` (no version suffix)

3. **Missing API keys**:
   - Remote endpoints require authentication
   - Set keys in config.yaml or environment variables

4. **Firewall blocking**:
   - Ensure port 11434 is open for Ollama
   - Ensure port 12434 is open for router

5. **Insufficient resources**:
   - Monitor CPU/memory on Ollama endpoints
   - Adjust `max_concurrent_connections` accordingly

## Best Practices

### Monitoring Setup

1. **Set up health checks**:
   - Monitor `/health` endpoint
   - Alert on status "error"

2. **Track usage metrics**:
   - Monitor connection counts
   - Track token usage
   - Watch for connection limits

3. **Log important events**:
   - Configuration changes
   - Endpoint failures
   - Recovery events

4. **Regular backups**:
   - Backup token_counts.db
   - Schedule regular backups
   - Test restore procedure

### Performance Monitoring

1. **Baseline metrics**:
   - Establish normal usage patterns
   - Track trends over time

2. **Alert thresholds**:
   - Set alerts for high connection counts
   - Monitor error rates
   - Watch for latency spikes

3. **Capacity planning**:
   - Track growth in token usage
   - Plan for scaling needs
   - Monitor resource utilization

### Incident Response

1. **Quick diagnosis**:
   - Check health endpoint first
   - Review recent logs
   - Verify endpoint status

2. **Isolation**:
   - Identify affected endpoints
   - Isolate problematic components
   - Fallback to healthy endpoints

3. **Recovery**:
   - Restart router if needed
   - Rebalance load
   - Restore from backup if necessary

## Examples

See the [examples](examples/) directory for monitoring configuration examples.
feat: added buffer_lock to prevent race condition in high concurrency scenarios added documentation 2026-01-05 17:16:31 +01:00			`# Monitoring and Troubleshooting Guide`

			`## Monitoring Overview`

			`NOMYO Router provides comprehensive monitoring capabilities to track performance, health, and usage patterns.`

			`## Monitoring Endpoints`

			`### Health Check`

			```bash
			`curl http://localhost:12434/health`
			```

			`Response:`
			```json
			`{`
			`"status": "ok" \| "error",`
			`"endpoints": {`
			`"http://endpoint1:11434": {`
			`"status": "ok" \| "error",`
			`"version": "string" \| "detail": "error message"`
			`}`
			`}`
			`}`
			```

			`HTTP Status Codes:`
			- `200`: All endpoints healthy
			- `503`: One or more endpoints unhealthy

			`### Current Usage`

			```bash
			`curl http://localhost:12434/api/usage`
			```

			`Response:`
			```json
			`{`
			`"usage_counts": {`
			`"http://endpoint1:11434": {`
			`"llama3": 2,`
			`"mistral": 1`
			`},`
			`"http://endpoint2:11434": {`
			`"llama3": 0,`
			`"mistral": 3`
			`}`
			`},`
			`"token_usage_counts": {`
			`"http://endpoint1:11434": {`
			`"llama3": 1542,`
			`"mistral": 876`
			`}`
			`}`
			`}`
			```

			`### Token Statistics`

			```bash
			`curl http://localhost:12434/api/token_counts`
			```

			`Response:`
			```json
			`{`
			`"total_tokens": 2418,`
			`"breakdown": [`
			`{`
			`"endpoint": "http://endpoint1:11434",`
			`"model": "llama3",`
			`"input_tokens": 120,`
			`"output_tokens": 1422,`
			`"total_tokens": 1542`
			`},`
			`{`
			`"endpoint": "http://endpoint1:11434",`
			`"model": "mistral",`
			`"input_tokens": 80,`
			`"output_tokens": 796,`
			`"total_tokens": 876`
			`}`
			`]`
			`}`
			```

			`### Model Statistics`

			```bash
			`curl http://localhost:12434/api/stats -X POST -d '{"model": "llama3"}'`
			```

			`Response:`
			```json
			`{`
			`"model": "llama3",`
			`"input_tokens": 120,`
			`"output_tokens": 1422,`
			`"total_tokens": 1542,`
			`"time_series": [`
			`{`
			`"endpoint": "http://endpoint1:11434",`
			`"timestamp": 1712345600,`
			`"input_tokens": 20,`
			`"output_tokens": 150,`
			`"total_tokens": 170`
			`}`
			`],`
			`"endpoint_distribution": {`
			`"http://endpoint1:11434": 1542`
			`}`
			`}`
			```

			`### Configuration Status`

			```bash
			`curl http://localhost:12434/api/config`
			```

			`Response:`
			```json
			`{`
			`"endpoints": [`
			`{`
			`"url": "http://endpoint1:11434",`
			`"status": "ok" \| "error",`
			`"version": "string" \| "detail": "error message"`
			`}`
			`]`
			`}`
			```

			`### Real-time Usage Stream`

			```bash
			`curl http://localhost:12434/api/usage-stream`
			```

			`This provides Server-Sent Events (SSE) with real-time updates:`
			```
			`data: {"usage_counts": {...}, "token_usage_counts": {...}}`

			`data: {"usage_counts": {...}, "token_usage_counts": {...}}`
			```

			`## Monitoring Tools`

			`### Prometheus Integration`

			`Create a Prometheus scrape configuration:`

			```yaml
			`scrape_configs:`
			`- job_name: 'nomyo-router'`
			`metrics_path: '/api/usage'`
			`params:`
			`format: ['prometheus']`
			`static_configs:`
			`- targets: ['nomyo-router:12434']`
			```

			`### Grafana Dashboard`

			`Create a dashboard with these panels:`
			`- Endpoint health status`
			`- Current connection counts`
			`- Token usage (input/output/total)`
			`- Request rates`
			`- Response times`
			`- Error rates`

			`### Logging`

			`The router logs important events to stdout:`
			`- Configuration loading`
			`- Endpoint connection issues`
			`- Token counting operations`
			`- Error conditions`

			`## Troubleshooting Guide`

			`### Common Issues and Solutions`

			`#### 1. Endpoint Unavailable`

			`Symptoms:`
			`- Health check shows endpoint as "error"`
			`- Requests fail with connection errors`

			`Diagnosis:`
			```bash
			`curl http://localhost:12434/health`
			`curl http://localhost:12434/api/config`
			```

			`Solutions:`
			`- Verify Ollama endpoint is running`
			`- Check network connectivity`
			`- Verify firewall rules`
			`- Check DNS resolution`
			- Test direct connection: `curl http://endpoint:11434/api/version`

			`#### 2. Model Not Found`

			`Symptoms:`
			`- Error: "None of the configured endpoints advertise the model"`
			`- Requests fail with model not found`

			`Diagnosis:`
			```bash
			`curl http://localhost:12434/api/tags`
			`curl http://endpoint:11434/api/tags`
			```

			`Solutions:`
			- Pull the model on the endpoint: `curl http://endpoint:11434/api/pull -d '{"name": "llama3"}'`
			`- Verify model name spelling`
			`- Check if model is available on any endpoint`
			`- For OpenAI endpoints, ensure model exists in their catalog`

			`#### 3. High Latency`

			`Symptoms:`
			`- Slow response times`
			`- Requests timing out`

			`Diagnosis:`
			```bash
			`curl http://localhost:12434/api/usage`
			`curl http://localhost:12434/api/config`
			```

			`Solutions:`
			`- Check if endpoints are overloaded (high connection counts)`
			- Increase `max_concurrent_connections`
			`- Add more endpoints to the cluster`
			`- Monitor Ollama endpoint performance`
			`- Check network latency between router and endpoints`

			`#### 4. Connection Limits Reached`

			`Symptoms:`
			`- Requests queuing`
			`- High connection counts`
			`- Slow response times`

			`Diagnosis:`
			```bash
			`curl http://localhost:12434/api/usage`
			```

			`Solutions:`
			- Increase `max_concurrent_connections` in config.yaml
			`- Add more Ollama endpoints`
			`- Scale your Ollama cluster`
			`- Use MOE system for critical queries`

			`#### 5. Token Tracking Not Working`

			`Symptoms:`
			`- Token counts not updating`
			`- Database errors`

			`Diagnosis:`
			```bash
			`ls -la token_counts.db`
			`curl http://localhost:12434/api/token_counts`
			```

			`Solutions:`
			`- Verify database file permissions`
			`- Check if database path is writable`
			`- Restart router to rebuild database`
			`- Check disk space`
			- Verify environment variable `NOMYO_ROUTER_DB_PATH`

			`#### 6. Streaming Issues`

			`Symptoms:`
			`- Incomplete responses`
			`- Connection resets during streaming`
			`- Timeout errors`

			`Diagnosis:`
			`- Check router logs for errors`
			`- Test with non-streaming requests`
			`- Monitor connection counts`

			`Solutions:`
			`- Increase timeout settings`
			- Reduce `max_concurrent_connections`
			`- Check network stability`
			`- Test with smaller payloads`

			`### Error Messages`

			`#### "Failed to connect to endpoint"`

			`Cause: Network connectivity issue`
			`Action: Verify endpoint is reachable, check firewall, test DNS`

			`#### "None of the configured endpoints advertise the model"`

			`Cause: Model not pulled on any endpoint`
			`Action: Pull the model, verify model name`

			`#### "Timed out waiting for endpoint"`

			`Cause: Endpoint slow to respond`
			`Action: Check endpoint health, increase timeouts`

			`#### "Invalid JSON format in request body"`

			`Cause: Malformed request`
			`Action: Validate request payload, check API documentation`

			`#### "Missing required field 'model'"`

			`Cause: Request missing model parameter`
			`Action: Add model parameter to request`

			`### Performance Tuning`

			`#### Optimizing Connection Handling`

			`1. Adjust concurrency limits:`
			```yaml
			`max_concurrent_connections: 4`
			```

			`2. Monitor connection usage:`
			```bash
			`curl http://localhost:12434/api/usage`
			```

			`3. Scale horizontally:`
			`- Add more Ollama endpoints`
			`- Deploy multiple router instances`

			`#### Reducing Latency`

			`1. Keep models loaded:`
			`- Use frequently accessed models`
			- Monitor `/api/ps` for loaded models

			`2. Optimize endpoint selection:`
			`- Distribute models across endpoints`
			`- Balance load evenly`

			`3. Use caching:`
			`- Models cache (300s TTL)`
			`- Loaded models cache (30s TTL)`

			`#### Memory Management`

			`1. Monitor memory usage:`
			`- Token buffer grows with usage`
			`- Time-series data accumulates`

			`2. Adjust flush interval:`
			`- Default: 10 seconds`
			`- Can be increased for less frequent I/O`

			`3. Database maintenance:`
			`- Regular backups`
			`- Archive old data periodically`

			`### Database Management`

			`#### Viewing Token Data`

			```bash
			`sqlite3 token_counts.db "SELECT * FROM token_counts;"`
			`sqlite3 token_counts.db "SELECT * FROM time_series LIMIT 100;"`
			```

			`#### Aggregating Old Data`

			```bash
			`curl http://localhost:12434/api/aggregate_time_series_days \`
			`-X POST \`
			`-d '{"days": 30, "trim_old": true}'`
			```

			`#### Backing Up Database`

			```bash
			`cp token_counts.db token_counts.db.backup`
			```

			`#### Restoring Database`

			```bash
			`cp token_counts.db.backup token_counts.db`
			```

			`### Advanced Troubleshooting`

			`#### Debugging Endpoint Selection`

			`Enable debug logging to see endpoint selection decisions:`
			```bash
			`uvicorn router:app --host 0.0.0.0 --port 12434 --log-level debug`
			```

			`#### Testing Individual Endpoints`

			```bash
			`# Test endpoint directly`
			`curl http://endpoint:11434/api/version`

			`# Test model availability`
			`curl http://endpoint:11434/api/tags`

			`# Test model loading`
			`curl http://endpoint:11434/api/ps`
			```

			`#### Network Diagnostics`

			```bash
			`# Test connectivity`
			`nc -zv endpoint 11434`

			`# Test DNS resolution`
			`dig endpoint`

			`# Test latency`
			`ping endpoint`
			```

			`### Common Pitfalls`

			`1. Using localhost in Docker:`
			- Inside Docker, `localhost` refers to the container itself
			- Use `host.docker.internal` or Docker service names

			`2. Incorrect model names:`
			- Ollama: `llama3:latest`
			- OpenAI: `gpt-4` (no version suffix)

			`3. Missing API keys:`
			`- Remote endpoints require authentication`
			`- Set keys in config.yaml or environment variables`

			`4. Firewall blocking:`
			`- Ensure port 11434 is open for Ollama`
			`- Ensure port 12434 is open for router`

			`5. Insufficient resources:`
			`- Monitor CPU/memory on Ollama endpoints`
			- Adjust `max_concurrent_connections` accordingly

			`## Best Practices`

			`### Monitoring Setup`

			`1. Set up health checks:`
			- Monitor `/health` endpoint
			`- Alert on status "error"`

			`2. Track usage metrics:`
			`- Monitor connection counts`
			`- Track token usage`
			`- Watch for connection limits`

			`3. Log important events:`
			`- Configuration changes`
			`- Endpoint failures`
			`- Recovery events`

			`4. Regular backups:`
			`- Backup token_counts.db`
			`- Schedule regular backups`
			`- Test restore procedure`

			`### Performance Monitoring`

			`1. Baseline metrics:`
			`- Establish normal usage patterns`
			`- Track trends over time`

			`2. Alert thresholds:`
			`- Set alerts for high connection counts`
			`- Monitor error rates`
			`- Watch for latency spikes`

			`3. Capacity planning:`
			`- Track growth in token usage`
			`- Plan for scaling needs`
			`- Monitor resource utilization`

			`### Incident Response`

			`1. Quick diagnosis:`
			`- Check health endpoint first`
			`- Review recent logs`
			`- Verify endpoint status`

			`2. Isolation:`
			`- Identify affected endpoints`
			`- Isolate problematic components`
			`- Fallback to healthy endpoints`

			`3. Recovery:`
			`- Restart router if needed`
			`- Rebalance load`
			`- Restore from backup if necessary`

			`## Examples`

			`See the [examples](examples/) directory for monitoring configuration examples.`