trustgraph/docs/tech-specs/cassandra-consolidation.md
Alex Jenkins 8954fa3ad7 Feat: TrustGraph i18n & Documentation Translation Updates (#781)
Native CLI i18n: The TrustGraph CLI has built-in translation support
that dynamically loads language strings. You can test and use
different languages by simply passing the --lang flag (e.g., --lang
es for Spanish, --lang ru for Russian) or by configuring your
environment's LANG variable.

Automated Docs Translations: This PR introduces autonomously
translated Markdown documentation into several target languages,
including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew,
Arabic, Simplified Chinese, and Russian.
2026-04-14 12:08:32 +01:00

12 KiB

layout title parent
default Tech Spec: Cassandra Configuration Consolidation Tech Specs

Tech Spec: Cassandra Configuration Consolidation

Status: Draft
Author: Assistant
Date: 2024-09-03

Overview

This specification addresses the inconsistent naming and configuration patterns for Cassandra connection parameters across the TrustGraph codebase. Currently, two different parameter naming schemes exist (cassandra_* vs graph_*), leading to confusion and maintenance complexity.

Problem Statement

The codebase currently uses two distinct sets of Cassandra configuration parameters:

  1. Knowledge/Config/Library modules use:

    • cassandra_host (list of hosts)
    • cassandra_user
    • cassandra_password
  2. Graph/Storage modules use:

    • graph_host (single host, sometimes converted to list)
    • graph_username
    • graph_password
  3. Inconsistent command-line exposure:

    • Some processors (e.g., kg-store) don't expose Cassandra settings as command-line arguments
    • Other processors expose them with different names and formats
    • Help text doesn't reflect environment variable defaults

Both parameter sets connect to the same Cassandra cluster but with different naming conventions, causing:

  • Configuration confusion for users
  • Increased maintenance burden
  • Inconsistent documentation
  • Potential for misconfiguration
  • Inability to override settings via command-line in some processors

Proposed Solution

1. Standardize Parameter Names

All modules will use consistent cassandra_* parameter names:

  • cassandra_host - List of hosts (internally stored as list)
  • cassandra_username - Username for authentication
  • cassandra_password - Password for authentication

2. Command-Line Arguments

All processors MUST expose Cassandra configuration via command-line arguments:

  • --cassandra-host - Comma-separated list of hosts
  • --cassandra-username - Username for authentication
  • --cassandra-password - Password for authentication

3. Environment Variable Fallback

If command-line parameters are not explicitly provided, the system will check environment variables:

  • CASSANDRA_HOST - Comma-separated list of hosts
  • CASSANDRA_USERNAME - Username for authentication
  • CASSANDRA_PASSWORD - Password for authentication

4. Default Values

If neither command-line parameters nor environment variables are specified:

  • cassandra_host defaults to ["cassandra"]
  • cassandra_username defaults to None (no authentication)
  • cassandra_password defaults to None (no authentication)

5. Help Text Requirements

The --help output must:

  • Show environment variable values as defaults when set
  • Never display password values (show **** or <set> instead)
  • Clearly indicate the resolution order in help text

Example help output:

--cassandra-host HOST
    Cassandra host list, comma-separated (default: prod-cluster-1,prod-cluster-2)
    [from CASSANDRA_HOST environment variable]

--cassandra-username USERNAME
    Cassandra username (default: cassandra_user)
    [from CASSANDRA_USERNAME environment variable]
    
--cassandra-password PASSWORD  
    Cassandra password (default: <set from environment>)

Implementation Details

Parameter Resolution Order

For each Cassandra parameter, the resolution order will be:

  1. Command-line argument value
  2. Environment variable (CASSANDRA_*)
  3. Default value

Host Parameter Handling

The cassandra_host parameter:

  • Command-line accepts comma-separated string: --cassandra-host "host1,host2,host3"
  • Environment variable accepts comma-separated string: CASSANDRA_HOST="host1,host2,host3"
  • Internally always stored as list: ["host1", "host2", "host3"]
  • Single host: "localhost" → converted to ["localhost"]
  • Already a list: ["host1", "host2"] → used as-is

Authentication Logic

Authentication will be used when both cassandra_username and cassandra_password are provided:

if cassandra_username and cassandra_password:
    # Use SSL context and PlainTextAuthProvider
else:
    # Connect without authentication

Files to Modify

Modules using graph_* parameters (to be changed):

  • trustgraph-flow/trustgraph/storage/triples/cassandra/write.py
  • trustgraph-flow/trustgraph/storage/objects/cassandra/write.py
  • trustgraph-flow/trustgraph/storage/rows/cassandra/write.py
  • trustgraph-flow/trustgraph/query/triples/cassandra/service.py

Modules using cassandra_* parameters (to be updated with env fallback):

  • trustgraph-flow/trustgraph/tables/config.py
  • trustgraph-flow/trustgraph/tables/knowledge.py
  • trustgraph-flow/trustgraph/tables/library.py
  • trustgraph-flow/trustgraph/storage/knowledge/store.py
  • trustgraph-flow/trustgraph/cores/knowledge.py
  • trustgraph-flow/trustgraph/librarian/librarian.py
  • trustgraph-flow/trustgraph/librarian/service.py
  • trustgraph-flow/trustgraph/config/service/service.py
  • trustgraph-flow/trustgraph/cores/service.py

Test Files to Update:

  • tests/unit/test_cores/test_knowledge_manager.py
  • tests/unit/test_storage/test_triples_cassandra_storage.py
  • tests/unit/test_query/test_triples_cassandra_query.py
  • tests/integration/test_objects_cassandra_integration.py

Implementation Strategy

Phase 1: Create Common Configuration Helper

Create utility functions to standardize Cassandra configuration across all processors:

import os
import argparse

def get_cassandra_defaults():
    """Get default values from environment variables or fallback."""
    return {
        'host': os.getenv('CASSANDRA_HOST', 'cassandra'),
        'username': os.getenv('CASSANDRA_USERNAME'),
        'password': os.getenv('CASSANDRA_PASSWORD')
    }

def add_cassandra_args(parser: argparse.ArgumentParser):
    """
    Add standardized Cassandra arguments to an argument parser.
    Shows environment variable values in help text.
    """
    defaults = get_cassandra_defaults()
    
    # Format help text with env var indication
    host_help = f"Cassandra host list, comma-separated (default: {defaults['host']})"
    if 'CASSANDRA_HOST' in os.environ:
        host_help += " [from CASSANDRA_HOST]"
    
    username_help = f"Cassandra username"
    if defaults['username']:
        username_help += f" (default: {defaults['username']})"
        if 'CASSANDRA_USERNAME' in os.environ:
            username_help += " [from CASSANDRA_USERNAME]"
    
    password_help = "Cassandra password"
    if defaults['password']:
        password_help += " (default: <set>)"
        if 'CASSANDRA_PASSWORD' in os.environ:
            password_help += " [from CASSANDRA_PASSWORD]"
    
    parser.add_argument(
        '--cassandra-host',
        default=defaults['host'],
        help=host_help
    )
    
    parser.add_argument(
        '--cassandra-username',
        default=defaults['username'],
        help=username_help
    )
    
    parser.add_argument(
        '--cassandra-password',
        default=defaults['password'],
        help=password_help
    )

def resolve_cassandra_config(args) -> tuple[list[str], str|None, str|None]:
    """
    Convert argparse args to Cassandra configuration.
    
    Returns:
        tuple: (hosts_list, username, password)
    """
    # Convert host string to list
    if isinstance(args.cassandra_host, str):
        hosts = [h.strip() for h in args.cassandra_host.split(',')]
    else:
        hosts = args.cassandra_host
    
    return hosts, args.cassandra_username, args.cassandra_password

Phase 2: Update Modules Using graph_* Parameters

  1. Change parameter names from graph_* to cassandra_*
  2. Replace custom add_args() methods with standardized add_cassandra_args()
  3. Use the common configuration helper functions
  4. Update documentation strings

Example transformation:

# OLD CODE
@staticmethod
def add_args(parser):
    parser.add_argument(
        '-g', '--graph-host',
        default="localhost",
        help=f'Graph host (default: localhost)'
    )
    parser.add_argument(
        '--graph-username',
        default=None,
        help=f'Cassandra username'
    )

# NEW CODE  
@staticmethod
def add_args(parser):
    FlowProcessor.add_args(parser)
    add_cassandra_args(parser)  # Use standard helper

Phase 3: Update Modules Using cassandra_* Parameters

  1. Add command-line argument support where missing (e.g., kg-store)
  2. Replace existing argument definitions with add_cassandra_args()
  3. Use resolve_cassandra_config() for consistent resolution
  4. Ensure consistent host list handling

Phase 4: Update Tests and Documentation

  1. Update all test files to use new parameter names
  2. Update CLI documentation
  3. Update API documentation
  4. Add environment variable documentation

Backward Compatibility

To maintain backward compatibility during transition:

  1. Deprecation warnings for graph_* parameters
  2. Parameter aliasing - accept both old and new names initially
  3. Phased rollout over multiple releases
  4. Documentation updates with migration guide

Example backward compatibility code:

def __init__(self, **params):
    # Handle deprecated graph_* parameters
    if 'graph_host' in params:
        warnings.warn("graph_host is deprecated, use cassandra_host", DeprecationWarning)
        params.setdefault('cassandra_host', params.pop('graph_host'))
    
    if 'graph_username' in params:
        warnings.warn("graph_username is deprecated, use cassandra_username", DeprecationWarning)
        params.setdefault('cassandra_username', params.pop('graph_username'))
    
    # ... continue with standard resolution

Testing Strategy

  1. Unit tests for configuration resolution logic
  2. Integration tests with various configuration combinations
  3. Environment variable tests
  4. Backward compatibility tests with deprecated parameters
  5. Docker compose tests with environment variables

Documentation Updates

  1. Update all CLI command documentation
  2. Update API documentation
  3. Create migration guide
  4. Update Docker compose examples
  5. Update configuration reference documentation

Risks and Mitigation

Risk Impact Mitigation
Breaking changes for users High Implement backward compatibility period
Configuration confusion during transition Medium Clear documentation and deprecation warnings
Test failures Medium Comprehensive test updates
Docker deployment issues High Update all Docker compose examples

Success Criteria

  • All modules use consistent cassandra_* parameter names
  • All processors expose Cassandra settings via command-line arguments
  • Command-line help text shows environment variable defaults
  • Password values are never displayed in help text
  • Environment variable fallback works correctly
  • cassandra_host is consistently handled as a list internally
  • Backward compatibility maintained for at least 2 releases
  • All tests pass with new configuration system
  • Documentation fully updated
  • Docker compose examples work with environment variables

Timeline

  • Week 1: Implement common configuration helper and update graph_* modules
  • Week 2: Add environment variable support to existing cassandra_* modules
  • Week 3: Update tests and documentation
  • Week 4: Integration testing and bug fixes

Future Considerations

  • Consider extending this pattern to other database configurations (e.g., Elasticsearch)
  • Implement configuration validation and better error messages
  • Add support for Cassandra connection pooling configuration
  • Consider adding configuration file support (.env files)