# Tech Spec: Cassandra Knowledge Base Performance Refactor **Status:** Draft **Author:** Assistant **Date:** 2025-09-18 ## Overview This specification addresses performance issues in the TrustGraph Cassandra knowledge base implementation and proposes optimizations for RDF triple storage and querying. ## Current Implementation ### Schema Design The current implementation uses a single table design in `trustgraph-flow/trustgraph/direct/cassandra_kg.py`: ```sql CREATE TABLE triples ( collection text, s text, p text, o text, PRIMARY KEY (collection, s, p, o) ); ``` **Secondary Indexes:** - `triples_s` ON `s` (subject) - `triples_p` ON `p` (predicate) - `triples_o` ON `o` (object) ### Query Patterns The current implementation supports 8 distinct query patterns: 1. **get_all(collection, limit=50)** - Retrieve all triples for a collection ```sql SELECT s, p, o FROM triples WHERE collection = ? LIMIT 50 ``` 2. **get_s(collection, s, limit=10)** - Query by subject ```sql SELECT p, o FROM triples WHERE collection = ? AND s = ? LIMIT 10 ``` 3. **get_p(collection, p, limit=10)** - Query by predicate ```sql SELECT s, o FROM triples WHERE collection = ? AND p = ? LIMIT 10 ``` 4. **get_o(collection, o, limit=10)** - Query by object ```sql SELECT s, p FROM triples WHERE collection = ? AND o = ? LIMIT 10 ``` 5. **get_sp(collection, s, p, limit=10)** - Query by subject + predicate ```sql SELECT o FROM triples WHERE collection = ? AND s = ? AND p = ? LIMIT 10 ``` 6. **get_po(collection, p, o, limit=10)** - Query by predicate + object ⚠️ ```sql SELECT s FROM triples WHERE collection = ? AND p = ? AND o = ? LIMIT 10 ALLOW FILTERING ``` 7. **get_os(collection, o, s, limit=10)** - Query by object + subject ⚠️ ```sql SELECT p FROM triples WHERE collection = ? AND o = ? AND s = ? LIMIT 10 ALLOW FILTERING ``` 8. **get_spo(collection, s, p, o, limit=10)** - Exact triple match ```sql SELECT s as x FROM triples WHERE collection = ? AND s = ? AND p = ? AND o = ? LIMIT 10 ``` ### Current Architecture **File: `trustgraph-flow/trustgraph/direct/cassandra_kg.py`** - Single `KnowledgeGraph` class handling all operations - Connection pooling through global `_active_clusters` list - Fixed table name: `"triples"` - Keyspace per user model - SimpleStrategy replication with factor 1 **Integration Points:** - **Write Path:** `trustgraph-flow/trustgraph/storage/triples/cassandra/write.py` - **Query Path:** `trustgraph-flow/trustgraph/query/triples/cassandra/service.py` - **Knowledge Store:** `trustgraph-flow/trustgraph/tables/knowledge.py` ## Performance Issues Identified ### Schema-Level Issues 1. **Inefficient Primary Key Design** - Current: `PRIMARY KEY (collection, s, p, o)` - Results in poor clustering for common access patterns - Forces expensive secondary index usage 2. **Secondary Index Overuse** ⚠️ - Three secondary indexes on high-cardinality columns (s, p, o) - Secondary indexes in Cassandra are expensive and don't scale well - Queries 6 & 7 require `ALLOW FILTERING` indicating poor data modeling 3. **Hot Partition Risk** - Single partition key `collection` can create hot partitions - Large collections will concentrate on single nodes - No distribution strategy for load balancing ### Query-Level Issues 1. **ALLOW FILTERING Usage** ⚠️ - Two query types (get_po, get_os) require `ALLOW FILTERING` - These queries scan multiple partitions and are extremely expensive - Performance degrades linearly with data size 2. **Inefficient Access Patterns** - No optimization for common RDF query patterns - Missing compound indexes for frequent query combinations - No consideration for graph traversal patterns 3. **Lack of Query Optimization** - No prepared statements caching - No query hints or optimization strategies - No consideration for pagination beyond simple LIMIT ## Problem Statement The current Cassandra knowledge base implementation has two critical performance bottlenecks: ### 1. Inefficient get_po Query Performance The `get_po(collection, p, o)` query is extremely inefficient due to requiring `ALLOW FILTERING`: ```sql SELECT s FROM triples WHERE collection = ? AND p = ? AND o = ? LIMIT 10 ALLOW FILTERING ``` **Why this is problematic:** - `ALLOW FILTERING` forces Cassandra to scan all partitions within the collection - Performance degrades linearly with data size - This is a common RDF query pattern (finding subjects that have a specific predicate-object relationship) - Creates significant load on the cluster as data grows ### 2. Poor Clustering Strategy The current primary key `PRIMARY KEY (collection, s, p, o)` provides minimal clustering benefits: **Issues with current clustering:** - `collection` as partition key doesn't distribute data effectively - Most collections contain diverse data making clustering ineffective - No consideration for common access patterns in RDF queries - Large collections create hot partitions on single nodes - Clustering columns (s, p, o) don't optimize for typical graph traversal patterns **Impact:** - Queries don't benefit from data locality - Poor cache utilization - Uneven load distribution across cluster nodes - Scalability bottlenecks as collections grow ## Proposed Solution: Multi-Table Denormalization Strategy ### Overview Replace the single `triples` table with three purpose-built tables, each optimized for specific query patterns. This eliminates the need for secondary indexes and ALLOW FILTERING while providing optimal performance for all query types. ### New Schema Design **Table 1: Subject-Centric Queries** ```sql CREATE TABLE triples_by_subject ( collection text, s text, p text, o text, PRIMARY KEY ((collection, s), p, o) ); ``` - **Optimizes:** get_s, get_sp, get_spo, get_os - **Partition Key:** (collection, s) - Better distribution than collection alone - **Clustering:** (p, o) - Enables efficient predicate/object lookups for a subject **Table 2: Predicate-Object Queries** ```sql CREATE TABLE triples_by_po ( collection text, p text, o text, s text, PRIMARY KEY ((collection, p), o, s) ); ``` - **Optimizes:** get_p, get_po (eliminates ALLOW FILTERING!) - **Partition Key:** (collection, p) - Direct access by predicate - **Clustering:** (o, s) - Efficient object-subject traversal **Table 3: Object-Centric Queries** ```sql CREATE TABLE triples_by_object ( collection text, o text, s text, p text, PRIMARY KEY ((collection, o), s, p) ); ``` - **Optimizes:** get_o, get_os - **Partition Key:** (collection, o) - Direct access by object - **Clustering:** (s, p) - Efficient subject-predicate traversal ### Query Mapping | Original Query | Target Table | Performance Improvement | |----------------|-------------|------------------------| | get_all(collection) | triples_by_subject | Token-based pagination | | get_s(collection, s) | triples_by_subject | Direct partition access | | get_p(collection, p) | triples_by_po | Direct partition access | | get_o(collection, o) | triples_by_object | Direct partition access | | get_sp(collection, s, p) | triples_by_subject | Partition + clustering | | get_po(collection, p, o) | triples_by_po | **No more ALLOW FILTERING!** | | get_os(collection, o, s) | triples_by_subject | Partition + clustering | | get_spo(collection, s, p, o) | triples_by_subject | Exact key lookup | ### Benefits 1. **Eliminates ALLOW FILTERING** - Every query has an optimal access path 2. **No Secondary Indexes** - Each table IS the index for its query pattern 3. **Better Data Distribution** - Composite partition keys spread load effectively 4. **Predictable Performance** - Query time proportional to result size, not total data 5. **Leverages Cassandra Strengths** - Designed for Cassandra's architecture ## Implementation Plan ### Files Requiring Changes #### Primary Implementation File **`trustgraph-flow/trustgraph/direct/cassandra_kg.py`** - Complete rewrite required **Current Methods to Refactor:** ```python # Schema initialization def init(self) -> None # Replace single table with three tables # Insert operations def insert(self, collection, s, p, o) -> None # Write to all three tables # Query operations (API unchanged, implementation optimized) def get_all(self, collection, limit=50) # Use triples_by_subject def get_s(self, collection, s, limit=10) # Use triples_by_subject def get_p(self, collection, p, limit=10) # Use triples_by_po def get_o(self, collection, o, limit=10) # Use triples_by_object def get_sp(self, collection, s, p, limit=10) # Use triples_by_subject def get_po(self, collection, p, o, limit=10) # Use triples_by_po (NO ALLOW FILTERING!) def get_os(self, collection, o, s, limit=10) # Use triples_by_subject def get_spo(self, collection, s, p, o, limit=10) # Use triples_by_subject # Collection management def delete_collection(self, collection) -> None # Delete from all three tables ``` #### Integration Files (No Logic Changes Required) **`trustgraph-flow/trustgraph/storage/triples/cassandra/write.py`** - No changes needed - uses existing KnowledgeGraph API - Benefits automatically from performance improvements **`trustgraph-flow/trustgraph/query/triples/cassandra/service.py`** - No changes needed - uses existing KnowledgeGraph API - Benefits automatically from performance improvements ### Test Files Requiring Updates #### Unit Tests **`tests/unit/test_storage/test_triples_cassandra_storage.py`** - Update test expectations for schema changes - Add tests for multi-table consistency - Verify no ALLOW FILTERING in query plans **`tests/unit/test_query/test_triples_cassandra_query.py`** - Update performance assertions - Test all 8 query patterns against new tables - Verify query routing to correct tables #### Integration Tests **`tests/integration/test_cassandra_integration.py`** - End-to-end testing with new schema - Performance benchmarking comparisons - Data consistency verification across tables **`tests/unit/test_storage/test_cassandra_config_integration.py`** - Update schema validation tests - Test migration scenarios ### Implementation Strategy #### Phase 1: Schema and Core Methods 1. **Rewrite `init()` method** - Create three tables instead of one 2. **Rewrite `insert()` method** - Batch writes to all three tables 3. **Implement prepared statements** - For optimal performance 4. **Add table routing logic** - Direct queries to optimal tables #### Phase 2: Query Method Optimization 1. **Rewrite each get_* method** to use optimal table 2. **Remove all ALLOW FILTERING** usage 3. **Implement efficient clustering key usage** 4. **Add query performance logging** #### Phase 3: Collection Management 1. **Update `delete_collection()`** - Remove from all three tables 2. **Add consistency verification** - Ensure all tables stay in sync 3. **Implement batch operations** - For atomic multi-table operations ### Key Implementation Details #### Batch Write Strategy ```python def insert(self, collection, s, p, o): batch = BatchStatement() # Insert into all three tables batch.add(SimpleStatement( "INSERT INTO triples_by_subject (collection, s, p, o) VALUES (?, ?, ?, ?)" ), (collection, s, p, o)) batch.add(SimpleStatement( "INSERT INTO triples_by_po (collection, p, o, s) VALUES (?, ?, ?, ?)" ), (collection, p, o, s)) batch.add(SimpleStatement( "INSERT INTO triples_by_object (collection, o, s, p) VALUES (?, ?, ?, ?)" ), (collection, o, s, p)) self.session.execute(batch) ``` #### Query Routing Logic ```python def get_po(self, collection, p, o, limit=10): # Route to triples_by_po table - NO ALLOW FILTERING! return self.session.execute( "SELECT s FROM triples_by_po WHERE collection = ? AND p = ? AND o = ? LIMIT ?", (collection, p, o, limit) ) ``` #### Prepared Statement Optimization ```python def prepare_statements(self): # Cache prepared statements for better performance self.insert_subject_stmt = self.session.prepare( "INSERT INTO triples_by_subject (collection, s, p, o) VALUES (?, ?, ?, ?)" ) self.insert_po_stmt = self.session.prepare( "INSERT INTO triples_by_po (collection, p, o, s) VALUES (?, ?, ?, ?)" ) # ... etc for all tables and queries ``` ## Migration Strategy ### Data Migration Approach #### Option 1: Blue-Green Deployment (Recommended) 1. **Deploy new schema alongside existing** - Use different table names temporarily 2. **Dual-write period** - Write to both old and new schemas during transition 3. **Background migration** - Copy existing data to new tables 4. **Switch reads** - Route queries to new tables once data is migrated 5. **Drop old tables** - After verification period #### Option 2: In-Place Migration 1. **Schema addition** - Create new tables in existing keyspace 2. **Data migration script** - Batch copy from old table to new tables 3. **Application update** - Deploy new code after migration completes 4. **Old table cleanup** - Remove old table and indexes ### Backward Compatibility #### Deployment Strategy ```python # Environment variable to control table usage during migration USE_LEGACY_TABLES = os.getenv('CASSANDRA_USE_LEGACY', 'false').lower() == 'true' class KnowledgeGraph: def __init__(self, ...): if USE_LEGACY_TABLES: self.init_legacy_schema() else: self.init_optimized_schema() ``` #### Migration Script ```python def migrate_data(): # Read from old table old_triples = session.execute("SELECT collection, s, p, o FROM triples") # Batch write to new tables for batch in batched(old_triples, 100): batch_stmt = BatchStatement() for row in batch: # Add to all three new tables batch_stmt.add(insert_subject_stmt, row) batch_stmt.add(insert_po_stmt, (row.collection, row.p, row.o, row.s)) batch_stmt.add(insert_object_stmt, (row.collection, row.o, row.s, row.p)) session.execute(batch_stmt) ``` ### Validation Strategy #### Data Consistency Checks ```python def validate_migration(): # Count total records in old vs new tables old_count = session.execute("SELECT COUNT(*) FROM triples WHERE collection = ?", (collection,)) new_count = session.execute("SELECT COUNT(*) FROM triples_by_subject WHERE collection = ?", (collection,)) assert old_count == new_count, f"Record count mismatch: {old_count} vs {new_count}" # Spot check random samples sample_queries = generate_test_queries() for query in sample_queries: old_result = execute_legacy_query(query) new_result = execute_optimized_query(query) assert old_result == new_result, f"Query results differ for {query}" ``` ## Testing Strategy ### Performance Testing #### Benchmark Scenarios 1. **Query Performance Comparison** - Before/after performance metrics for all 8 query types - Focus on get_po performance improvement (eliminate ALLOW FILTERING) - Measure query latency under various data sizes 2. **Load Testing** - Concurrent query execution - Write throughput with batch operations - Memory and CPU utilization 3. **Scalability Testing** - Performance with increasing collection sizes - Multi-collection query distribution - Cluster node utilization #### Test Data Sets - **Small:** 10K triples per collection - **Medium:** 100K triples per collection - **Large:** 1M+ triples per collection - **Multiple collections:** Test partition distribution ### Functional Testing #### Unit Test Updates ```python # Example test structure for new implementation class TestCassandraKGPerformance: def test_get_po_no_allow_filtering(self): # Verify get_po queries don't use ALLOW FILTERING with patch('cassandra.cluster.Session.execute') as mock_execute: kg.get_po('test_collection', 'predicate', 'object') executed_query = mock_execute.call_args[0][0] assert 'ALLOW FILTERING' not in executed_query def test_multi_table_consistency(self): # Verify all tables stay in sync kg.insert('test', 's1', 'p1', 'o1') # Check all tables contain the triple assert_triple_exists('triples_by_subject', 'test', 's1', 'p1', 'o1') assert_triple_exists('triples_by_po', 'test', 'p1', 'o1', 's1') assert_triple_exists('triples_by_object', 'test', 'o1', 's1', 'p1') ``` #### Integration Test Updates ```python class TestCassandraIntegration: def test_query_performance_regression(self): # Ensure new implementation is faster than old old_time = benchmark_legacy_get_po() new_time = benchmark_optimized_get_po() assert new_time < old_time * 0.5 # At least 50% improvement def test_end_to_end_workflow(self): # Test complete write -> query -> delete cycle # Verify no performance degradation in integration ``` ### Rollback Plan #### Quick Rollback Strategy 1. **Environment variable toggle** - Switch back to legacy tables immediately 2. **Keep legacy tables** - Don't drop until performance is proven 3. **Monitoring alerts** - Automated rollback triggers based on error rates/latency #### Rollback Validation ```python def rollback_to_legacy(): # Set environment variable os.environ['CASSANDRA_USE_LEGACY'] = 'true' # Restart services to pick up change restart_cassandra_services() # Validate functionality run_smoke_tests() ``` ## Risks and Considerations ### Performance Risks - **Write latency increase** - 3x write operations per insert - **Storage overhead** - 3x storage requirement - **Batch write failures** - Need proper error handling ### Operational Risks - **Migration complexity** - Data migration for large datasets - **Consistency challenges** - Ensuring all tables stay synchronized - **Monitoring gaps** - Need new metrics for multi-table operations ### Mitigation Strategies 1. **Gradual rollout** - Start with small collections 2. **Comprehensive monitoring** - Track all performance metrics 3. **Automated validation** - Continuous consistency checking 4. **Quick rollback capability** - Environment-based table selection ## Success Criteria ### Performance Improvements - [ ] **Eliminate ALLOW FILTERING** - get_po and get_os queries run without filtering - [ ] **Query latency reduction** - 50%+ improvement in query response times - [ ] **Better load distribution** - No hot partitions, even load across cluster nodes - [ ] **Scalable performance** - Query time proportional to result size, not total data ### Functional Requirements - [ ] **API compatibility** - All existing code continues to work unchanged - [ ] **Data consistency** - All three tables remain synchronized - [ ] **Zero data loss** - Migration preserves all existing triples - [ ] **Backward compatibility** - Ability to rollback to legacy schema ### Operational Requirements - [ ] **Safe migration** - Blue-green deployment with rollback capability - [ ] **Monitoring coverage** - Comprehensive metrics for multi-table operations - [ ] **Test coverage** - All query patterns tested with performance benchmarks - [ ] **Documentation** - Updated deployment and operational procedures ## Timeline ### Phase 1: Implementation - [ ] Rewrite `cassandra_kg.py` with multi-table schema - [ ] Implement batch write operations - [ ] Add prepared statement optimization - [ ] Update unit tests ### Phase 2: Integration Testing - [ ] Update integration tests - [ ] Performance benchmarking - [ ] Load testing with realistic data volumes - [ ] Validation scripts for data consistency ### Phase 3: Migration Planning - [ ] Blue-green deployment scripts - [ ] Data migration tools - [ ] Monitoring dashboard updates - [ ] Rollback procedures ### Phase 4: Production Deployment - [ ] Staged rollout to production - [ ] Performance monitoring and validation - [ ] Legacy table cleanup - [ ] Documentation updates ## Conclusion This multi-table denormalization strategy directly addresses the two critical performance bottlenecks: 1. **Eliminates expensive ALLOW FILTERING** by providing optimal table structures for each query pattern 2. **Improves clustering effectiveness** through composite partition keys that distribute load properly The approach leverages Cassandra's strengths while maintaining complete API compatibility, ensuring existing code benefits automatically from the performance improvements.