added buffer_lock to prevent race condition in high concurrency scenarios
added documentation

2026-01-05 17:16:31 +01:00

10 KiB

Raw Blame History

Monitoring and Troubleshooting Guide

Monitoring Overview

NOMYO Router provides comprehensive monitoring capabilities to track performance, health, and usage patterns.

Monitoring Endpoints

Health Check

curl http://localhost:12434/health

Response:

{
  "status": "ok" | "error",
  "endpoints": {
    "http://endpoint1:11434": {
      "status": "ok" | "error",
      "version": "string" | "detail": "error message"
    }
  }
}

HTTP Status Codes:

200: All endpoints healthy
503: One or more endpoints unhealthy

Current Usage

curl http://localhost:12434/api/usage

Response:

{
  "usage_counts": {
    "http://endpoint1:11434": {
      "llama3": 2,
      "mistral": 1
    },
    "http://endpoint2:11434": {
      "llama3": 0,
      "mistral": 3
    }
  },
  "token_usage_counts": {
    "http://endpoint1:11434": {
      "llama3": 1542,
      "mistral": 876
    }
  }
}

Token Statistics

curl http://localhost:12434/api/token_counts

Response:

{
  "total_tokens": 2418,
  "breakdown": [
    {
      "endpoint": "http://endpoint1:11434",
      "model": "llama3",
      "input_tokens": 120,
      "output_tokens": 1422,
      "total_tokens": 1542
    },
    {
      "endpoint": "http://endpoint1:11434",
      "model": "mistral",
      "input_tokens": 80,
      "output_tokens": 796,
      "total_tokens": 876
    }
  ]
}

Model Statistics

curl http://localhost:12434/api/stats -X POST -d '{"model": "llama3"}'

Response:

{
  "model": "llama3",
  "input_tokens": 120,
  "output_tokens": 1422,
  "total_tokens": 1542,
  "time_series": [
    {
      "endpoint": "http://endpoint1:11434",
      "timestamp": 1712345600,
      "input_tokens": 20,
      "output_tokens": 150,
      "total_tokens": 170
    }
  ],
  "endpoint_distribution": {
    "http://endpoint1:11434": 1542
  }
}

Configuration Status

curl http://localhost:12434/api/config

Response:

{
  "endpoints": [
    {
      "url": "http://endpoint1:11434",
      "status": "ok" | "error",
      "version": "string" | "detail": "error message"
    }
  ]
}

Real-time Usage Stream

curl http://localhost:12434/api/usage-stream

This provides Server-Sent Events (SSE) with real-time updates:

data: {"usage_counts": {...}, "token_usage_counts": {...}}

data: {"usage_counts": {...}, "token_usage_counts": {...}}

Monitoring Tools

Prometheus Integration

Create a Prometheus scrape configuration:

scrape_configs:
  - job_name: 'nomyo-router'
    metrics_path: '/api/usage'
    params:
      format: ['prometheus']
    static_configs:
      - targets: ['nomyo-router:12434']

Grafana Dashboard

Create a dashboard with these panels:

Endpoint health status
Current connection counts
Token usage (input/output/total)
Request rates
Response times
Error rates

Logging

The router logs important events to stdout:

Configuration loading
Endpoint connection issues
Token counting operations
Error conditions

Troubleshooting Guide

Common Issues and Solutions

1. Endpoint Unavailable

Symptoms:

Health check shows endpoint as "error"
Requests fail with connection errors

Diagnosis:

curl http://localhost:12434/health
curl http://localhost:12434/api/config

Solutions:

Verify Ollama endpoint is running
Check network connectivity
Verify firewall rules
Check DNS resolution
Test direct connection: curl http://endpoint:11434/api/version

2. Model Not Found

Symptoms:

Error: "None of the configured endpoints advertise the model"
Requests fail with model not found

Diagnosis:

curl http://localhost:12434/api/tags
curl http://endpoint:11434/api/tags

Solutions:

Pull the model on the endpoint: curl http://endpoint:11434/api/pull -d '{"name": "llama3"}'
Verify model name spelling
Check if model is available on any endpoint
For OpenAI endpoints, ensure model exists in their catalog

3. High Latency

Symptoms:

Slow response times
Requests timing out

Diagnosis:

curl http://localhost:12434/api/usage
curl http://localhost:12434/api/config

Solutions:

Check if endpoints are overloaded (high connection counts)
Increase max_concurrent_connections
Add more endpoints to the cluster
Monitor Ollama endpoint performance
Check network latency between router and endpoints

4. Connection Limits Reached

Symptoms:

Requests queuing
High connection counts
Slow response times

Diagnosis:

curl http://localhost:12434/api/usage

Solutions:

Increase max_concurrent_connections in config.yaml
Add more Ollama endpoints
Scale your Ollama cluster
Use MOE system for critical queries

5. Token Tracking Not Working

Symptoms:

Token counts not updating
Database errors

Diagnosis:

ls -la token_counts.db
curl http://localhost:12434/api/token_counts

Solutions:

Verify database file permissions
Check if database path is writable
Restart router to rebuild database
Check disk space
Verify environment variable NOMYO_ROUTER_DB_PATH

6. Streaming Issues

Symptoms:

Incomplete responses
Connection resets during streaming
Timeout errors

Diagnosis:

Check router logs for errors
Test with non-streaming requests
Monitor connection counts

Solutions:

Increase timeout settings
Reduce max_concurrent_connections
Check network stability
Test with smaller payloads

Error Messages

"Failed to connect to endpoint"

Cause: Network connectivity issue Action: Verify endpoint is reachable, check firewall, test DNS

"None of the configured endpoints advertise the model"

Cause: Model not pulled on any endpoint Action: Pull the model, verify model name

"Timed out waiting for endpoint"

Cause: Endpoint slow to respond Action: Check endpoint health, increase timeouts

"Invalid JSON format in request body"

Cause: Malformed request Action: Validate request payload, check API documentation

"Missing required field 'model'"

Cause: Request missing model parameter Action: Add model parameter to request

Performance Tuning

Optimizing Connection Handling

Adjust concurrency limits:
```
max_concurrent_connections: 4
```
Monitor connection usage:
```
curl http://localhost:12434/api/usage
```
Scale horizontally:
- Add more Ollama endpoints
- Deploy multiple router instances

Reducing Latency

Keep models loaded:
- Use frequently accessed models
- Monitor /api/ps for loaded models
Optimize endpoint selection:
- Distribute models across endpoints
- Balance load evenly
Use caching:
- Models cache (300s TTL)
- Loaded models cache (30s TTL)

Memory Management

Monitor memory usage:
- Token buffer grows with usage
- Time-series data accumulates
Adjust flush interval:
- Default: 10 seconds
- Can be increased for less frequent I/O
Database maintenance:
- Regular backups
- Archive old data periodically

Database Management

Viewing Token Data

sqlite3 token_counts.db "SELECT * FROM token_counts;"
sqlite3 token_counts.db "SELECT * FROM time_series LIMIT 100;"

Aggregating Old Data

curl http://localhost:12434/api/aggregate_time_series_days \
  -X POST \
  -d '{"days": 30, "trim_old": true}'

Backing Up Database

cp token_counts.db token_counts.db.backup

Restoring Database

cp token_counts.db.backup token_counts.db

Advanced Troubleshooting

Debugging Endpoint Selection

Enable debug logging to see endpoint selection decisions:

uvicorn router:app --host 0.0.0.0 --port 12434 --log-level debug

Testing Individual Endpoints

# Test endpoint directly
curl http://endpoint:11434/api/version

# Test model availability
curl http://endpoint:11434/api/tags

# Test model loading
curl http://endpoint:11434/api/ps

Network Diagnostics

# Test connectivity
nc -zv endpoint 11434

# Test DNS resolution
dig endpoint

# Test latency
ping endpoint

Common Pitfalls

Using localhost in Docker:
- Inside Docker, localhost refers to the container itself
- Use host.docker.internal or Docker service names
Incorrect model names:
- Ollama: llama3:latest
- OpenAI: gpt-4 (no version suffix)
Missing API keys:
- Remote endpoints require authentication
- Set keys in config.yaml or environment variables
Firewall blocking:
- Ensure port 11434 is open for Ollama
- Ensure port 12434 is open for router
Insufficient resources:
- Monitor CPU/memory on Ollama endpoints
- Adjust max_concurrent_connections accordingly

Best Practices

Monitoring Setup

Set up health checks:
- Monitor /health endpoint
- Alert on status "error"
Track usage metrics:
- Monitor connection counts
- Track token usage
- Watch for connection limits
Log important events:
- Configuration changes
- Endpoint failures
- Recovery events
Regular backups:
- Backup token_counts.db
- Schedule regular backups
- Test restore procedure

Performance Monitoring

Baseline metrics:
- Establish normal usage patterns
- Track trends over time
Alert thresholds:
- Set alerts for high connection counts
- Monitor error rates
- Watch for latency spikes
Capacity planning:
- Track growth in token usage
- Plan for scaling needs
- Monitor resource utilization

Incident Response

Quick diagnosis:
- Check health endpoint first
- Review recent logs
- Verify endpoint status
Isolation:
- Identify affected endpoints
- Isolate problematic components
- Fallback to healthy endpoints
Recovery:
- Restart router if needed
- Rebalance load
- Restore from backup if necessary

Examples

See the examples directory for monitoring configuration examples.

10 KiB Raw Blame History

Monitoring and Troubleshooting Guide

Monitoring Overview

Monitoring Endpoints

Health Check

Current Usage

Token Statistics

Model Statistics

Configuration Status

Real-time Usage Stream

Monitoring Tools

Prometheus Integration

Grafana Dashboard

Logging

Troubleshooting Guide

Common Issues and Solutions

1. Endpoint Unavailable

2. Model Not Found

3. High Latency

4. Connection Limits Reached

5. Token Tracking Not Working

6. Streaming Issues

Error Messages

"Failed to connect to endpoint"

"None of the configured endpoints advertise the model"

"Timed out waiting for endpoint"

"Invalid JSON format in request body"

"Missing required field 'model'"

Performance Tuning

Optimizing Connection Handling

Reducing Latency

Memory Management

Database Management

Viewing Token Data

Aggregating Old Data

Backing Up Database

Restoring Database

Advanced Troubleshooting

Debugging Endpoint Selection

Testing Individual Endpoints

Network Diagnostics

Common Pitfalls

Best Practices

Monitoring Setup

Performance Monitoring

Incident Response

Examples

10 KiB

Raw Blame History