nomyo-router/doc/monitoring.md

10 KiB

Monitoring and Troubleshooting Guide

Monitoring Overview

NOMYO Router provides comprehensive monitoring capabilities to track performance, health, and usage patterns.

Monitoring Endpoints

Health Check

curl http://localhost:12434/health

Response:

{
  "status": "ok" | "error",
  "endpoints": {
    "http://endpoint1:11434": {
      "status": "ok" | "error",
      "version": "string" | "detail": "error message"
    }
  }
}

HTTP Status Codes:

  • 200: All endpoints healthy
  • 503: One or more endpoints unhealthy

Current Usage

curl http://localhost:12434/api/usage

Response:

{
  "usage_counts": {
    "http://endpoint1:11434": {
      "llama3": 2,
      "mistral": 1
    },
    "http://endpoint2:11434": {
      "llama3": 0,
      "mistral": 3
    }
  },
  "token_usage_counts": {
    "http://endpoint1:11434": {
      "llama3": 1542,
      "mistral": 876
    }
  }
}

Token Statistics

curl http://localhost:12434/api/token_counts

Response:

{
  "total_tokens": 2418,
  "breakdown": [
    {
      "endpoint": "http://endpoint1:11434",
      "model": "llama3",
      "input_tokens": 120,
      "output_tokens": 1422,
      "total_tokens": 1542
    },
    {
      "endpoint": "http://endpoint1:11434",
      "model": "mistral",
      "input_tokens": 80,
      "output_tokens": 796,
      "total_tokens": 876
    }
  ]
}

Model Statistics

curl http://localhost:12434/api/stats -X POST -d '{"model": "llama3"}'

Response:

{
  "model": "llama3",
  "input_tokens": 120,
  "output_tokens": 1422,
  "total_tokens": 1542,
  "time_series": [
    {
      "endpoint": "http://endpoint1:11434",
      "timestamp": 1712345600,
      "input_tokens": 20,
      "output_tokens": 150,
      "total_tokens": 170
    }
  ],
  "endpoint_distribution": {
    "http://endpoint1:11434": 1542
  }
}

Configuration Status

curl http://localhost:12434/api/config

Response:

{
  "endpoints": [
    {
      "url": "http://endpoint1:11434",
      "status": "ok" | "error",
      "version": "string" | "detail": "error message"
    }
  ]
}

Cache Statistics

curl http://localhost:12434/api/cache/stats

Response when cache is enabled:

{
  "enabled": true,
  "hits": 1547,
  "misses": 892,
  "hit_rate": 0.634,
  "semantic": true,
  "backend": "sqlite",
  "similarity_threshold": 0.9,
  "history_weight": 0.3
}

Response when cache is disabled:

{ "enabled": false }

Cache Invalidation

curl -X POST http://localhost:12434/api/cache/invalidate

Clears all cached entries and resets hit/miss counters.

Real-time Usage Stream

curl http://localhost:12434/api/usage-stream

This provides Server-Sent Events (SSE) with real-time updates:

data: {"usage_counts": {...}, "token_usage_counts": {...}}

data: {"usage_counts": {...}, "token_usage_counts": {...}}

Monitoring Tools

Prometheus Integration

Create a Prometheus scrape configuration:

scrape_configs:
  - job_name: 'nomyo-router'
    metrics_path: '/api/usage'
    params:
      format: ['prometheus']
    static_configs:
      - targets: ['nomyo-router:12434']

Grafana Dashboard

Create a dashboard with these panels:

  • Endpoint health status
  • Current connection counts
  • Token usage (input/output/total)
  • Request rates
  • Response times
  • Error rates

Logging

The router logs important events to stdout:

  • Configuration loading
  • Endpoint connection issues
  • Token counting operations
  • Error conditions

Troubleshooting Guide

Common Issues and Solutions

1. Endpoint Unavailable

Symptoms:

  • Health check shows endpoint as "error"
  • Requests fail with connection errors

Diagnosis:

curl http://localhost:12434/health
curl http://localhost:12434/api/config

Solutions:

  • Verify Ollama endpoint is running
  • Check network connectivity
  • Verify firewall rules
  • Check DNS resolution
  • Test direct connection: curl http://endpoint:11434/api/version

2. Model Not Found

Symptoms:

  • Error: "None of the configured endpoints advertise the model"
  • Requests fail with model not found

Diagnosis:

curl http://localhost:12434/api/tags
curl http://endpoint:11434/api/tags

Solutions:

  • Pull the model on the endpoint: curl http://endpoint:11434/api/pull -d '{"name": "llama3"}'
  • Verify model name spelling
  • Check if model is available on any endpoint
  • For OpenAI endpoints, ensure model exists in their catalog

3. High Latency

Symptoms:

  • Slow response times
  • Requests timing out

Diagnosis:

curl http://localhost:12434/api/usage
curl http://localhost:12434/api/config

Solutions:

  • Check if endpoints are overloaded (high connection counts)
  • Increase max_concurrent_connections
  • Add more endpoints to the cluster
  • Monitor Ollama endpoint performance
  • Check network latency between router and endpoints

4. Connection Limits Reached

Symptoms:

  • Requests queuing
  • High connection counts
  • Slow response times

Diagnosis:

curl http://localhost:12434/api/usage

Solutions:

  • Increase max_concurrent_connections in config.yaml
  • Add more Ollama endpoints
  • Scale your Ollama cluster
  • Use MOE system for critical queries

5. Token Tracking Not Working

Symptoms:

  • Token counts not updating
  • Database errors

Diagnosis:

ls -la token_counts.db
curl http://localhost:12434/api/token_counts

Solutions:

  • Verify database file permissions
  • Check if database path is writable
  • Restart router to rebuild database
  • Check disk space
  • Verify environment variable NOMYO_ROUTER_DB_PATH

6. Streaming Issues

Symptoms:

  • Incomplete responses
  • Connection resets during streaming
  • Timeout errors

Diagnosis:

  • Check router logs for errors
  • Test with non-streaming requests
  • Monitor connection counts

Solutions:

  • Increase timeout settings
  • Reduce max_concurrent_connections
  • Check network stability
  • Test with smaller payloads

Error Messages

"Failed to connect to endpoint"

Cause: Network connectivity issue Action: Verify endpoint is reachable, check firewall, test DNS

"None of the configured endpoints advertise the model"

Cause: Model not pulled on any endpoint Action: Pull the model, verify model name

"Timed out waiting for endpoint"

Cause: Endpoint slow to respond Action: Check endpoint health, increase timeouts

"Invalid JSON format in request body"

Cause: Malformed request Action: Validate request payload, check API documentation

"Missing required field 'model'"

Cause: Request missing model parameter Action: Add model parameter to request

Performance Tuning

Optimizing Connection Handling

  1. Adjust concurrency limits:

    max_concurrent_connections: 4
    
  2. Monitor connection usage:

    curl http://localhost:12434/api/usage
    
  3. Scale horizontally:

    • Add more Ollama endpoints
    • Deploy multiple router instances

Reducing Latency

  1. Keep models loaded:

    • Use frequently accessed models
    • Monitor /api/ps for loaded models
  2. Optimize endpoint selection:

    • Distribute models across endpoints
    • Balance load evenly
  3. Use caching:

    • Models cache (300s TTL)
    • Loaded models cache (30s TTL)

Memory Management

  1. Monitor memory usage:

    • Token buffer grows with usage
    • Time-series data accumulates
  2. Adjust flush interval:

    • Default: 10 seconds
    • Can be increased for less frequent I/O
  3. Database maintenance:

    • Regular backups
    • Archive old data periodically

Database Management

Viewing Token Data

sqlite3 token_counts.db "SELECT * FROM token_counts;"
sqlite3 token_counts.db "SELECT * FROM time_series LIMIT 100;"

Aggregating Old Data

curl http://localhost:12434/api/aggregate_time_series_days \
  -X POST \
  -d '{"days": 30, "trim_old": true}'

Backing Up Database

cp token_counts.db token_counts.db.backup

Restoring Database

cp token_counts.db.backup token_counts.db

Advanced Troubleshooting

Debugging Endpoint Selection

Enable debug logging to see endpoint selection decisions:

uvicorn router:app --host 0.0.0.0 --port 12434 --log-level debug

Testing Individual Endpoints

# Test endpoint directly
curl http://endpoint:11434/api/version

# Test model availability
curl http://endpoint:11434/api/tags

# Test model loading
curl http://endpoint:11434/api/ps

Network Diagnostics

# Test connectivity
nc -zv endpoint 11434

# Test DNS resolution
dig endpoint

# Test latency
ping endpoint

Common Pitfalls

  1. Using localhost in Docker:

    • Inside Docker, localhost refers to the container itself
    • Use host.docker.internal or Docker service names
  2. Incorrect model names:

    • Ollama: llama3:latest
    • OpenAI: gpt-4 (no version suffix)
  3. Missing API keys:

    • Remote endpoints require authentication
    • Set keys in config.yaml or environment variables
  4. Firewall blocking:

    • Ensure port 11434 is open for Ollama
    • Ensure port 12434 is open for router
  5. Insufficient resources:

    • Monitor CPU/memory on Ollama endpoints
    • Adjust max_concurrent_connections accordingly

Best Practices

Monitoring Setup

  1. Set up health checks:

    • Monitor /health endpoint
    • Alert on status "error"
  2. Track usage metrics:

    • Monitor connection counts
    • Track token usage
    • Watch for connection limits
  3. Log important events:

    • Configuration changes
    • Endpoint failures
    • Recovery events
  4. Regular backups:

    • Backup token_counts.db
    • Schedule regular backups
    • Test restore procedure

Performance Monitoring

  1. Baseline metrics:

    • Establish normal usage patterns
    • Track trends over time
  2. Alert thresholds:

    • Set alerts for high connection counts
    • Monitor error rates
    • Watch for latency spikes
  3. Capacity planning:

    • Track growth in token usage
    • Plan for scaling needs
    • Monitor resource utilization

Incident Response

  1. Quick diagnosis:

    • Check health endpoint first
    • Review recent logs
    • Verify endpoint status
  2. Isolation:

    • Identify affected endpoints
    • Isolate problematic components
    • Fallback to healthy endpoints
  3. Recovery:

    • Restart router if needed
    • Rebalance load
    • Restore from backup if necessary

Examples

See the examples directory for monitoring configuration examples.