* Cassandra consolidation of parameters * New Cassandra configuration helper * Implemented Cassanda config refactor * New tests
11 KiB
Tech Spec: Cassandra Configuration Consolidation
Status: Draft
Author: Assistant
Date: 2024-09-03
Overview
This specification addresses the inconsistent naming and configuration patterns for Cassandra connection parameters across the TrustGraph codebase. Currently, two different parameter naming schemes exist (cassandra_* vs graph_*), leading to confusion and maintenance complexity.
Problem Statement
The codebase currently uses two distinct sets of Cassandra configuration parameters:
-
Knowledge/Config/Library modules use:
cassandra_host(list of hosts)cassandra_usercassandra_password
-
Graph/Storage modules use:
graph_host(single host, sometimes converted to list)graph_usernamegraph_password
-
Inconsistent command-line exposure:
- Some processors (e.g.,
kg-store) don't expose Cassandra settings as command-line arguments - Other processors expose them with different names and formats
- Help text doesn't reflect environment variable defaults
- Some processors (e.g.,
Both parameter sets connect to the same Cassandra cluster but with different naming conventions, causing:
- Configuration confusion for users
- Increased maintenance burden
- Inconsistent documentation
- Potential for misconfiguration
- Inability to override settings via command-line in some processors
Proposed Solution
1. Standardize Parameter Names
All modules will use consistent cassandra_* parameter names:
cassandra_host- List of hosts (internally stored as list)cassandra_username- Username for authenticationcassandra_password- Password for authentication
2. Command-Line Arguments
All processors MUST expose Cassandra configuration via command-line arguments:
--cassandra-host- Comma-separated list of hosts--cassandra-username- Username for authentication--cassandra-password- Password for authentication
3. Environment Variable Fallback
If command-line parameters are not explicitly provided, the system will check environment variables:
CASSANDRA_HOST- Comma-separated list of hostsCASSANDRA_USERNAME- Username for authenticationCASSANDRA_PASSWORD- Password for authentication
4. Default Values
If neither command-line parameters nor environment variables are specified:
cassandra_hostdefaults to["cassandra"]cassandra_usernamedefaults toNone(no authentication)cassandra_passworddefaults toNone(no authentication)
5. Help Text Requirements
The --help output must:
- Show environment variable values as defaults when set
- Never display password values (show
****or<set>instead) - Clearly indicate the resolution order in help text
Example help output:
--cassandra-host HOST
Cassandra host list, comma-separated (default: prod-cluster-1,prod-cluster-2)
[from CASSANDRA_HOST environment variable]
--cassandra-username USERNAME
Cassandra username (default: cassandra_user)
[from CASSANDRA_USERNAME environment variable]
--cassandra-password PASSWORD
Cassandra password (default: <set from environment>)
Implementation Details
Parameter Resolution Order
For each Cassandra parameter, the resolution order will be:
- Command-line argument value
- Environment variable (
CASSANDRA_*) - Default value
Host Parameter Handling
The cassandra_host parameter:
- Command-line accepts comma-separated string:
--cassandra-host "host1,host2,host3" - Environment variable accepts comma-separated string:
CASSANDRA_HOST="host1,host2,host3" - Internally always stored as list:
["host1", "host2", "host3"] - Single host:
"localhost"→ converted to["localhost"] - Already a list:
["host1", "host2"]→ used as-is
Authentication Logic
Authentication will be used when both cassandra_username and cassandra_password are provided:
if cassandra_username and cassandra_password:
# Use SSL context and PlainTextAuthProvider
else:
# Connect without authentication
Files to Modify
Modules using graph_* parameters (to be changed):
trustgraph-flow/trustgraph/storage/triples/cassandra/write.pytrustgraph-flow/trustgraph/storage/objects/cassandra/write.pytrustgraph-flow/trustgraph/storage/rows/cassandra/write.pytrustgraph-flow/trustgraph/query/triples/cassandra/service.py
Modules using cassandra_* parameters (to be updated with env fallback):
trustgraph-flow/trustgraph/tables/config.pytrustgraph-flow/trustgraph/tables/knowledge.pytrustgraph-flow/trustgraph/tables/library.pytrustgraph-flow/trustgraph/storage/knowledge/store.pytrustgraph-flow/trustgraph/cores/knowledge.pytrustgraph-flow/trustgraph/librarian/librarian.pytrustgraph-flow/trustgraph/librarian/service.pytrustgraph-flow/trustgraph/config/service/service.pytrustgraph-flow/trustgraph/cores/service.py
Test Files to Update:
tests/unit/test_cores/test_knowledge_manager.pytests/unit/test_storage/test_triples_cassandra_storage.pytests/unit/test_query/test_triples_cassandra_query.pytests/integration/test_objects_cassandra_integration.py
Implementation Strategy
Phase 1: Create Common Configuration Helper
Create utility functions to standardize Cassandra configuration across all processors:
import os
import argparse
def get_cassandra_defaults():
"""Get default values from environment variables or fallback."""
return {
'host': os.getenv('CASSANDRA_HOST', 'cassandra'),
'username': os.getenv('CASSANDRA_USERNAME'),
'password': os.getenv('CASSANDRA_PASSWORD')
}
def add_cassandra_args(parser: argparse.ArgumentParser):
"""
Add standardized Cassandra arguments to an argument parser.
Shows environment variable values in help text.
"""
defaults = get_cassandra_defaults()
# Format help text with env var indication
host_help = f"Cassandra host list, comma-separated (default: {defaults['host']})"
if 'CASSANDRA_HOST' in os.environ:
host_help += " [from CASSANDRA_HOST]"
username_help = f"Cassandra username"
if defaults['username']:
username_help += f" (default: {defaults['username']})"
if 'CASSANDRA_USERNAME' in os.environ:
username_help += " [from CASSANDRA_USERNAME]"
password_help = "Cassandra password"
if defaults['password']:
password_help += " (default: <set>)"
if 'CASSANDRA_PASSWORD' in os.environ:
password_help += " [from CASSANDRA_PASSWORD]"
parser.add_argument(
'--cassandra-host',
default=defaults['host'],
help=host_help
)
parser.add_argument(
'--cassandra-username',
default=defaults['username'],
help=username_help
)
parser.add_argument(
'--cassandra-password',
default=defaults['password'],
help=password_help
)
def resolve_cassandra_config(args) -> tuple[list[str], str|None, str|None]:
"""
Convert argparse args to Cassandra configuration.
Returns:
tuple: (hosts_list, username, password)
"""
# Convert host string to list
if isinstance(args.cassandra_host, str):
hosts = [h.strip() for h in args.cassandra_host.split(',')]
else:
hosts = args.cassandra_host
return hosts, args.cassandra_username, args.cassandra_password
Phase 2: Update Modules Using graph_* Parameters
- Change parameter names from
graph_*tocassandra_* - Replace custom
add_args()methods with standardizedadd_cassandra_args() - Use the common configuration helper functions
- Update documentation strings
Example transformation:
# OLD CODE
@staticmethod
def add_args(parser):
parser.add_argument(
'-g', '--graph-host',
default="localhost",
help=f'Graph host (default: localhost)'
)
parser.add_argument(
'--graph-username',
default=None,
help=f'Cassandra username'
)
# NEW CODE
@staticmethod
def add_args(parser):
FlowProcessor.add_args(parser)
add_cassandra_args(parser) # Use standard helper
Phase 3: Update Modules Using cassandra_* Parameters
- Add command-line argument support where missing (e.g.,
kg-store) - Replace existing argument definitions with
add_cassandra_args() - Use
resolve_cassandra_config()for consistent resolution - Ensure consistent host list handling
Phase 4: Update Tests and Documentation
- Update all test files to use new parameter names
- Update CLI documentation
- Update API documentation
- Add environment variable documentation
Backward Compatibility
To maintain backward compatibility during transition:
- Deprecation warnings for
graph_*parameters - Parameter aliasing - accept both old and new names initially
- Phased rollout over multiple releases
- Documentation updates with migration guide
Example backward compatibility code:
def __init__(self, **params):
# Handle deprecated graph_* parameters
if 'graph_host' in params:
warnings.warn("graph_host is deprecated, use cassandra_host", DeprecationWarning)
params.setdefault('cassandra_host', params.pop('graph_host'))
if 'graph_username' in params:
warnings.warn("graph_username is deprecated, use cassandra_username", DeprecationWarning)
params.setdefault('cassandra_username', params.pop('graph_username'))
# ... continue with standard resolution
Testing Strategy
- Unit tests for configuration resolution logic
- Integration tests with various configuration combinations
- Environment variable tests
- Backward compatibility tests with deprecated parameters
- Docker compose tests with environment variables
Documentation Updates
- Update all CLI command documentation
- Update API documentation
- Create migration guide
- Update Docker compose examples
- Update configuration reference documentation
Risks and Mitigation
| Risk | Impact | Mitigation |
|---|---|---|
| Breaking changes for users | High | Implement backward compatibility period |
| Configuration confusion during transition | Medium | Clear documentation and deprecation warnings |
| Test failures | Medium | Comprehensive test updates |
| Docker deployment issues | High | Update all Docker compose examples |
Success Criteria
- All modules use consistent
cassandra_*parameter names - All processors expose Cassandra settings via command-line arguments
- Command-line help text shows environment variable defaults
- Password values are never displayed in help text
- Environment variable fallback works correctly
cassandra_hostis consistently handled as a list internally- Backward compatibility maintained for at least 2 releases
- All tests pass with new configuration system
- Documentation fully updated
- Docker compose examples work with environment variables
Timeline
- Week 1: Implement common configuration helper and update
graph_*modules - Week 2: Add environment variable support to existing
cassandra_*modules - Week 3: Update tests and documentation
- Week 4: Integration testing and bug fixes
Future Considerations
- Consider extending this pattern to other database configurations (e.g., Elasticsearch)
- Implement configuration validation and better error messages
- Add support for Cassandra connection pooling configuration
- Consider adding configuration file support (.env files)