nomyo-router/doc/monitoring.md

# Monitoring and Troubleshooting Guide

## Monitoring Overview

NOMYO Router provides comprehensive monitoring capabilities to track performance, health, and usage patterns.

## Monitoring Endpoints

### Health Check

```bash
curl http://localhost:12434/health
```

Response:
```json
{
  "status": "ok" | "error",
  "endpoints": {
    "http://endpoint1:11434": {
      "status": "ok" | "error",
      "version": "string" | "detail": "error message"
    }
  }
}
```

**HTTP Status Codes**:
- `200`: All endpoints healthy
- `503`: One or more endpoints unhealthy

### Current Usage

```bash
curl http://localhost:12434/api/usage
```

Response:
```json
{
  "usage_counts": {
    "http://endpoint1:11434": {
      "llama3": 2,
      "mistral": 1
    },
    "http://endpoint2:11434": {
      "llama3": 0,
      "mistral": 3
    }
  },
  "token_usage_counts": {
    "http://endpoint1:11434": {
      "llama3": 1542,
      "mistral": 876
    }
  }
}
```

### Token Statistics

```bash
curl http://localhost:12434/api/token_counts
```

Response:
```json
{
  "total_tokens": 2418,
  "breakdown": [
    {
      "endpoint": "http://endpoint1:11434",
      "model": "llama3",
      "input_tokens": 120,
      "output_tokens": 1422,
      "total_tokens": 1542
    },
    {
      "endpoint": "http://endpoint1:11434",
      "model": "mistral",
      "input_tokens": 80,
      "output_tokens": 796,
      "total_tokens": 876
    }
  ]
}
```

### Model Statistics

```bash
curl http://localhost:12434/api/stats -X POST -d '{"model": "llama3"}'
```

Response:
```json
{
  "model": "llama3",
  "input_tokens": 120,
  "output_tokens": 1422,
  "total_tokens": 1542,
  "time_series": [
    {
      "endpoint": "http://endpoint1:11434",
      "timestamp": 1712345600,
      "input_tokens": 20,
      "output_tokens": 150,
      "total_tokens": 170
    }
  ],
  "endpoint_distribution": {
    "http://endpoint1:11434": 1542
  }
}
```

### Configuration Status

```bash
curl http://localhost:12434/api/config
```

Response:
```json
{
  "endpoints": [
    {
      "url": "http://endpoint1:11434",
      "status": "ok" | "error",
      "version": "string" | "detail": "error message"
    }
  ]
}
```

### Cache Statistics

```bash
curl http://localhost:12434/api/cache/stats
```

Response when cache is enabled:
```json
{
  "enabled": true,
  "hits": 1547,
  "misses": 892,
  "hit_rate": 0.634,
  "semantic": true,
  "backend": "sqlite",
  "similarity_threshold": 0.9,
  "history_weight": 0.3
}
```

Response when cache is disabled:
```json
{ "enabled": false }
```

### Cache Invalidation

```bash
curl -X POST http://localhost:12434/api/cache/invalidate
```

Clears all cached entries and resets hit/miss counters.

### Real-time Usage Stream

```bash
curl http://localhost:12434/api/usage-stream
```

This provides Server-Sent Events (SSE) with real-time updates:
```
data: {"usage_counts": {...}, "token_usage_counts": {...}}

data: {"usage_counts": {...}, "token_usage_counts": {...}}
```

## Monitoring Tools

### Prometheus Integration

Create a Prometheus scrape configuration:

```yaml
scrape_configs:
  - job_name: 'nomyo-router'
    metrics_path: '/api/usage'
    params:
      format: ['prometheus']
    static_configs:
      - targets: ['nomyo-router:12434']
```

### Grafana Dashboard

Create a dashboard with these panels:
- Endpoint health status
- Current connection counts
- Token usage (input/output/total)
- Request rates
- Response times
- Error rates

### Logging

The router logs important events to stdout:
- Configuration loading
- Endpoint connection issues
- Token counting operations
- Error conditions

## Troubleshooting Guide

### Common Issues and Solutions

#### 1. Endpoint Unavailable

**Symptoms**:
- Health check shows endpoint as "error"
- Requests fail with connection errors

**Diagnosis**:
```bash
curl http://localhost:12434/health
curl http://localhost:12434/api/config
```

**Solutions**:
- Verify Ollama endpoint is running
- Check network connectivity
- Verify firewall rules
- Check DNS resolution
- Test direct connection: `curl http://endpoint:11434/api/version`

#### 2. Model Not Found

**Symptoms**:
- Error: "None of the configured endpoints advertise the model"
- Requests fail with model not found

**Diagnosis**:
```bash
curl http://localhost:12434/api/tags
curl http://endpoint:11434/api/tags
```

**Solutions**:
- Pull the model on the endpoint: `curl http://endpoint:11434/api/pull -d '{"name": "llama3"}'`
- Verify model name spelling
- Check if model is available on any endpoint
- For OpenAI endpoints, ensure model exists in their catalog

#### 3. High Latency

**Symptoms**:
- Slow response times
- Requests timing out

**Diagnosis**:
```bash
curl http://localhost:12434/api/usage
curl http://localhost:12434/api/config
```

**Solutions**:
- Check if endpoints are overloaded (high connection counts)
- Increase `max_concurrent_connections`
- Add more endpoints to the cluster
- Monitor Ollama endpoint performance
- Check network latency between router and endpoints

#### 4. Connection Limits Reached

**Symptoms**:
- Requests queuing
- High connection counts
- Slow response times

**Diagnosis**:
```bash
curl http://localhost:12434/api/usage
```

**Solutions**:
- Increase `max_concurrent_connections` in config.yaml
- Add more Ollama endpoints
- Scale your Ollama cluster
- Use MOE system for critical queries

#### 5. Token Tracking Not Working

**Symptoms**:
- Token counts not updating
- Database errors

**Diagnosis**:
```bash
ls -la token_counts.db
curl http://localhost:12434/api/token_counts
```

**Solutions**:
- Verify database file permissions
- Check if database path is writable
- Restart router to rebuild database
- Check disk space
- Verify environment variable `NOMYO_ROUTER_DB_PATH`

#### 6. Streaming Issues

**Symptoms**:
- Incomplete responses
- Connection resets during streaming
- Timeout errors

**Diagnosis**:
- Check router logs for errors
- Test with non-streaming requests
- Monitor connection counts

**Solutions**:
- Increase timeout settings
- Reduce `max_concurrent_connections`
- Check network stability
- Test with smaller payloads

### Error Messages

#### "Failed to connect to endpoint"

**Cause**: Network connectivity issue
**Action**: Verify endpoint is reachable, check firewall, test DNS

#### "None of the configured endpoints advertise the model"

**Cause**: Model not pulled on any endpoint
**Action**: Pull the model, verify model name

#### "Timed out waiting for endpoint"

**Cause**: Endpoint slow to respond
**Action**: Check endpoint health, increase timeouts

#### "Invalid JSON format in request body"

**Cause**: Malformed request
**Action**: Validate request payload, check API documentation

#### "Missing required field 'model'"

**Cause**: Request missing model parameter
**Action**: Add model parameter to request

### Performance Tuning

#### Optimizing Connection Handling

1. **Adjust concurrency limits**:
   ```yaml
   max_concurrent_connections: 4
   ```

2. **Monitor connection usage**:
   ```bash
   curl http://localhost:12434/api/usage
   ```

3. **Scale horizontally**:
   - Add more Ollama endpoints
   - Deploy multiple router instances

#### Reducing Latency

1. **Keep models loaded**:
   - Use frequently accessed models
   - Monitor `/api/ps` for loaded models

2. **Optimize endpoint selection**:
   - Distribute models across endpoints
   - Balance load evenly

3. **Use caching**:
   - Models cache (300s TTL)
   - Loaded models cache (30s TTL)

#### Memory Management

1. **Monitor memory usage**:
   - Token buffer grows with usage
   - Time-series data accumulates

2. **Adjust flush interval**:
   - Default: 10 seconds
   - Can be increased for less frequent I/O

3. **Database maintenance**:
   - Regular backups
   - Archive old data periodically

### Database Management

#### Viewing Token Data

```bash
sqlite3 token_counts.db "SELECT * FROM token_counts;"
sqlite3 token_counts.db "SELECT * FROM time_series LIMIT 100;"
```

#### Aggregating Old Data

```bash
curl http://localhost:12434/api/aggregate_time_series_days \
  -X POST \
  -d '{"days": 30, "trim_old": true}'
```

#### Backing Up Database

```bash
cp token_counts.db token_counts.db.backup
```

#### Restoring Database

```bash
cp token_counts.db.backup token_counts.db
```

### Advanced Troubleshooting

#### Debugging Endpoint Selection

Enable debug logging to see endpoint selection decisions:
```bash
uvicorn router:app --host 0.0.0.0 --port 12434 --log-level debug
```

#### Testing Individual Endpoints

```bash
# Test endpoint directly
curl http://endpoint:11434/api/version

# Test model availability
curl http://endpoint:11434/api/tags

# Test model loading
curl http://endpoint:11434/api/ps
```

#### Network Diagnostics

```bash
# Test connectivity
nc -zv endpoint 11434

# Test DNS resolution
dig endpoint

# Test latency
ping endpoint
```

### Common Pitfalls

1. **Using localhost in Docker**:
   - Inside Docker, `localhost` refers to the container itself
   - Use `host.docker.internal` or Docker service names

2. **Incorrect model names**:
   - Ollama: `llama3:latest`
   - OpenAI: `gpt-4` (no version suffix)

3. **Missing API keys**:
   - Remote endpoints require authentication
   - Set keys in config.yaml or environment variables

4. **Firewall blocking**:
   - Ensure port 11434 is open for Ollama
   - Ensure port 12434 is open for router

5. **Insufficient resources**:
   - Monitor CPU/memory on Ollama endpoints
   - Adjust `max_concurrent_connections` accordingly

## Best Practices

### Monitoring Setup

1. **Set up health checks**:
   - Monitor `/health` endpoint
   - Alert on status "error"

2. **Track usage metrics**:
   - Monitor connection counts
   - Track token usage
   - Watch for connection limits

3. **Log important events**:
   - Configuration changes
   - Endpoint failures
   - Recovery events

4. **Regular backups**:
   - Backup token_counts.db
   - Schedule regular backups
   - Test restore procedure

### Performance Monitoring

1. **Baseline metrics**:
   - Establish normal usage patterns
   - Track trends over time

2. **Alert thresholds**:
   - Set alerts for high connection counts
   - Monitor error rates
   - Watch for latency spikes

3. **Capacity planning**:
   - Track growth in token usage
   - Plan for scaling needs
   - Monitor resource utilization

### Incident Response

1. **Quick diagnosis**:
   - Check health endpoint first
   - Review recent logs
   - Verify endpoint status

2. **Isolation**:
   - Identify affected endpoints
   - Isolate problematic components
   - Fallback to healthy endpoints

3. **Recovery**:
   - Restart router if needed
   - Rebalance load
   - Restore from backup if necessary

## Examples

See the [examples](examples/) directory for monitoring configuration examples.