trustgraph/docs/tech-specs/logging-strategy.md
cybermaggedon f12fcc2652
Loki logging (#586)
* Consolidate logging into a single module

* Added Loki logging

* Update tech spec

* Add processor label

* Fix recursive log entries, logging Loki"s internals
2025-12-09 23:24:41 +00:00

9.8 KiB

TrustGraph Logging Strategy

Overview

TrustGraph uses Python's built-in logging module for all logging operations, with centralized configuration and optional Loki integration for log aggregation. This provides a standardized, flexible approach to logging across all components of the system.

Default Configuration

Logging Level

  • Default Level: INFO
  • Configurable via: --log-level command-line argument
  • Choices: DEBUG, INFO, WARNING, ERROR, CRITICAL

Output Destinations

  1. Console (stdout): Always enabled - ensures compatibility with containerized environments
  2. Loki: Optional centralized log aggregation (enabled by default, can be disabled)

Centralized Logging Module

All logging configuration is managed by trustgraph.base.logging module, which provides:

  • add_logging_args(parser) - Adds standard logging CLI arguments
  • setup_logging(args) - Configures logging from parsed arguments

This module is used by all server-side components:

  • AsyncProcessor-based services
  • API Gateway
  • MCP Server

Implementation Guidelines

1. Logger Initialization

Each module should create its own logger using the module's __name__:

import logging

logger = logging.getLogger(__name__)

The logger name is automatically used as a label in Loki for filtering and searching.

2. Service Initialization

All server-side services automatically get logging configuration through the centralized module:

from trustgraph.base import add_logging_args, setup_logging
import argparse

def main():
    parser = argparse.ArgumentParser()

    # Add standard logging arguments (includes Loki configuration)
    add_logging_args(parser)

    # Add your service-specific arguments
    parser.add_argument('--port', type=int, default=8080)

    args = parser.parse_args()
    args = vars(args)

    # Setup logging early in startup
    setup_logging(args)

    # Rest of your service initialization
    logger = logging.getLogger(__name__)
    logger.info("Service starting...")

3. Command-Line Arguments

All services support these logging arguments:

Log Level:

--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}

Loki Configuration:

--loki-enabled              # Enable Loki (default)
--no-loki-enabled           # Disable Loki
--loki-url URL              # Loki push URL (default: http://loki:3100/loki/api/v1/push)
--loki-username USERNAME    # Optional authentication
--loki-password PASSWORD    # Optional authentication

Examples:

# Default - INFO level, Loki enabled
./my-service

# Debug mode, console only
./my-service --log-level DEBUG --no-loki-enabled

# Custom Loki server with auth
./my-service --loki-url http://loki.prod:3100/loki/api/v1/push \
             --loki-username admin --loki-password secret

4. Environment Variables

Loki configuration supports environment variable fallbacks:

export LOKI_URL=http://loki.prod:3100/loki/api/v1/push
export LOKI_USERNAME=admin
export LOKI_PASSWORD=secret

Command-line arguments take precedence over environment variables.

5. Logging Best Practices

Log Levels Usage

  • DEBUG: Detailed information for diagnosing problems (variable values, function entry/exit)
  • INFO: General informational messages (service started, configuration loaded, processing milestones)
  • WARNING: Warning messages for potentially harmful situations (deprecated features, recoverable errors)
  • ERROR: Error messages for serious problems (failed operations, exceptions)
  • CRITICAL: Critical messages for system failures requiring immediate attention

Message Format

# Good - includes context
logger.info(f"Processing document: {doc_id}, size: {doc_size} bytes")
logger.error(f"Failed to connect to database: {error}", exc_info=True)

# Avoid - lacks context
logger.info("Processing document")
logger.error("Connection failed")

Performance Considerations

# Use lazy formatting for expensive operations
logger.debug("Expensive operation result: %s", expensive_function())

# Check log level for very expensive debug operations
if logger.isEnabledFor(logging.DEBUG):
    debug_data = compute_expensive_debug_info()
    logger.debug(f"Debug data: {debug_data}")

6. Structured Logging with Loki

For complex data, use structured logging with extra tags for Loki:

logger.info("Request processed", extra={
    'tags': {
        'request_id': request_id,
        'user_id': user_id,
        'status': 'success'
    }
})

These tags become searchable labels in Loki, in addition to automatic labels:

  • severity - Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
  • logger - Module name (from __name__)

7. Exception Logging

Always include stack traces for exceptions:

try:
    process_data()
except Exception as e:
    logger.error(f"Failed to process data: {e}", exc_info=True)
    raise

8. Async Logging Considerations

The logging system uses non-blocking queued handlers for Loki:

  • Console output is synchronous (fast)
  • Loki output is queued with 500-message buffer
  • Background thread handles Loki transmission
  • No blocking of main application code
import asyncio
import logging

async def async_operation():
    logger = logging.getLogger(__name__)
    # Logging is thread-safe and won't block async operations
    logger.info(f"Starting async operation in task: {asyncio.current_task().get_name()}")

Loki Integration

Architecture

The logging system uses Python's built-in QueueHandler and QueueListener for non-blocking Loki integration:

  1. QueueHandler: Logs are placed in a 500-message queue (non-blocking)
  2. Background Thread: QueueListener sends logs to Loki asynchronously
  3. Graceful Degradation: If Loki is unavailable, console logging continues

Automatic Labels

Every log sent to Loki includes:

  • processor: Processor identity (e.g., config-svc, text-completion, embeddings)
  • severity: Log level (DEBUG, INFO, etc.)
  • logger: Module name (e.g., trustgraph.gateway.service, trustgraph.agent.react.service)

Custom Labels

Add custom labels via the extra parameter:

logger.info("User action", extra={
    'tags': {
        'user_id': user_id,
        'action': 'document_upload',
        'collection': collection_name
    }
})

Querying Logs in Loki

# All logs from a specific processor (recommended - matches Prometheus metrics)
{processor="config-svc"}
{processor="text-completion"}
{processor="embeddings"}

# Error logs from a specific processor
{processor="config-svc", severity="ERROR"}

# Error logs from all processors
{severity="ERROR"}

# Logs from a specific processor with text filter
{processor="text-completion"} |= "Processing"

# All logs from API gateway
{processor="api-gateway"}

# Logs from processors matching pattern
{processor=~".*-completion"}

# Logs with custom tags
{processor="api-gateway"} | json | user_id="12345"

Graceful Degradation

If Loki is unavailable or python-logging-loki is not installed:

  • Warning message printed to console
  • Console logging continues normally
  • Application continues running
  • No retry logic for Loki connection (fail fast, degrade gracefully)

Testing

During tests, consider using a different logging configuration:

# In test setup
import logging

# Reduce noise during tests
logging.getLogger().setLevel(logging.WARNING)

# Or disable Loki for tests
setup_logging({'log_level': 'WARNING', 'loki_enabled': False})

Monitoring Integration

Standard Format

All logs use consistent format:

2025-01-09 10:30:45,123 - trustgraph.gateway.service - INFO - Request processed

Format components:

  • Timestamp (ISO format with milliseconds)
  • Logger name (module path)
  • Log level
  • Message

Loki Queries for Monitoring

Common monitoring queries:

# Error rate by processor
rate({severity="ERROR"}[5m]) by (processor)

# Top error-producing processors
topk(5, count_over_time({severity="ERROR"}[1h]) by (processor))

# Recent errors with processor name
{severity="ERROR"} | line_format "{{.processor}}: {{.message}}"

# All agent processors
{processor=~".*agent.*"} |= "exception"

# Specific processor error count
count_over_time({processor="config-svc", severity="ERROR"}[1h])

Security Considerations

  • Never log sensitive information (passwords, API keys, personal data, tokens)
  • Sanitize user input before logging
  • Use placeholders for sensitive fields: user_id=****1234
  • Loki authentication: Use --loki-username and --loki-password for secure deployments
  • Secure transport: Use HTTPS for Loki URL in production: https://loki.prod:3100/loki/api/v1/push

Dependencies

The centralized logging module requires:

  • python-logging-loki - For Loki integration (optional, graceful degradation if missing)

Already included in trustgraph-base/pyproject.toml and requirements.txt.

Migration Path

For existing code:

  1. Services already using AsyncProcessor: No changes needed, Loki support is automatic
  2. Services not using AsyncProcessor (api-gateway, mcp-server): Already updated
  3. CLI tools: Out of scope - continue using print() or simple logging

From print() to logging:

# Before
print(f"Processing document {doc_id}")

# After
logger = logging.getLogger(__name__)
logger.info(f"Processing document {doc_id}")

Configuration Summary

Argument Default Environment Variable Description
--log-level INFO - Console and Loki log level
--loki-enabled True - Enable Loki logging
--loki-url http://loki:3100/loki/api/v1/push LOKI_URL Loki push endpoint
--loki-username None LOKI_USERNAME Loki auth username
--loki-password None LOKI_PASSWORD Loki auth password