trustgraph/docs/tech-specs/cassandra-performance-refactor.md
cybermaggedon 6c7af8789d
Release 1.4 -> master (#524)
Catch up
2025-09-20 16:00:37 +01:00

582 lines
20 KiB
Markdown

# Tech Spec: Cassandra Knowledge Base Performance Refactor
**Status:** Draft
**Author:** Assistant
**Date:** 2025-09-18
## Overview
This specification addresses performance issues in the TrustGraph Cassandra knowledge base implementation and proposes optimizations for RDF triple storage and querying.
## Current Implementation
### Schema Design
The current implementation uses a single table design in `trustgraph-flow/trustgraph/direct/cassandra_kg.py`:
```sql
CREATE TABLE triples (
collection text,
s text,
p text,
o text,
PRIMARY KEY (collection, s, p, o)
);
```
**Secondary Indexes:**
- `triples_s` ON `s` (subject)
- `triples_p` ON `p` (predicate)
- `triples_o` ON `o` (object)
### Query Patterns
The current implementation supports 8 distinct query patterns:
1. **get_all(collection, limit=50)** - Retrieve all triples for a collection
```sql
SELECT s, p, o FROM triples WHERE collection = ? LIMIT 50
```
2. **get_s(collection, s, limit=10)** - Query by subject
```sql
SELECT p, o FROM triples WHERE collection = ? AND s = ? LIMIT 10
```
3. **get_p(collection, p, limit=10)** - Query by predicate
```sql
SELECT s, o FROM triples WHERE collection = ? AND p = ? LIMIT 10
```
4. **get_o(collection, o, limit=10)** - Query by object
```sql
SELECT s, p FROM triples WHERE collection = ? AND o = ? LIMIT 10
```
5. **get_sp(collection, s, p, limit=10)** - Query by subject + predicate
```sql
SELECT o FROM triples WHERE collection = ? AND s = ? AND p = ? LIMIT 10
```
6. **get_po(collection, p, o, limit=10)** - Query by predicate + object ⚠️
```sql
SELECT s FROM triples WHERE collection = ? AND p = ? AND o = ? LIMIT 10 ALLOW FILTERING
```
7. **get_os(collection, o, s, limit=10)** - Query by object + subject ⚠️
```sql
SELECT p FROM triples WHERE collection = ? AND o = ? AND s = ? LIMIT 10 ALLOW FILTERING
```
8. **get_spo(collection, s, p, o, limit=10)** - Exact triple match
```sql
SELECT s as x FROM triples WHERE collection = ? AND s = ? AND p = ? AND o = ? LIMIT 10
```
### Current Architecture
**File: `trustgraph-flow/trustgraph/direct/cassandra_kg.py`**
- Single `KnowledgeGraph` class handling all operations
- Connection pooling through global `_active_clusters` list
- Fixed table name: `"triples"`
- Keyspace per user model
- SimpleStrategy replication with factor 1
**Integration Points:**
- **Write Path:** `trustgraph-flow/trustgraph/storage/triples/cassandra/write.py`
- **Query Path:** `trustgraph-flow/trustgraph/query/triples/cassandra/service.py`
- **Knowledge Store:** `trustgraph-flow/trustgraph/tables/knowledge.py`
## Performance Issues Identified
### Schema-Level Issues
1. **Inefficient Primary Key Design**
- Current: `PRIMARY KEY (collection, s, p, o)`
- Results in poor clustering for common access patterns
- Forces expensive secondary index usage
2. **Secondary Index Overuse** ⚠️
- Three secondary indexes on high-cardinality columns (s, p, o)
- Secondary indexes in Cassandra are expensive and don't scale well
- Queries 6 & 7 require `ALLOW FILTERING` indicating poor data modeling
3. **Hot Partition Risk**
- Single partition key `collection` can create hot partitions
- Large collections will concentrate on single nodes
- No distribution strategy for load balancing
### Query-Level Issues
1. **ALLOW FILTERING Usage** ⚠️
- Two query types (get_po, get_os) require `ALLOW FILTERING`
- These queries scan multiple partitions and are extremely expensive
- Performance degrades linearly with data size
2. **Inefficient Access Patterns**
- No optimization for common RDF query patterns
- Missing compound indexes for frequent query combinations
- No consideration for graph traversal patterns
3. **Lack of Query Optimization**
- No prepared statements caching
- No query hints or optimization strategies
- No consideration for pagination beyond simple LIMIT
## Problem Statement
The current Cassandra knowledge base implementation has two critical performance bottlenecks:
### 1. Inefficient get_po Query Performance
The `get_po(collection, p, o)` query is extremely inefficient due to requiring `ALLOW FILTERING`:
```sql
SELECT s FROM triples WHERE collection = ? AND p = ? AND o = ? LIMIT 10 ALLOW FILTERING
```
**Why this is problematic:**
- `ALLOW FILTERING` forces Cassandra to scan all partitions within the collection
- Performance degrades linearly with data size
- This is a common RDF query pattern (finding subjects that have a specific predicate-object relationship)
- Creates significant load on the cluster as data grows
### 2. Poor Clustering Strategy
The current primary key `PRIMARY KEY (collection, s, p, o)` provides minimal clustering benefits:
**Issues with current clustering:**
- `collection` as partition key doesn't distribute data effectively
- Most collections contain diverse data making clustering ineffective
- No consideration for common access patterns in RDF queries
- Large collections create hot partitions on single nodes
- Clustering columns (s, p, o) don't optimize for typical graph traversal patterns
**Impact:**
- Queries don't benefit from data locality
- Poor cache utilization
- Uneven load distribution across cluster nodes
- Scalability bottlenecks as collections grow
## Proposed Solution: Multi-Table Denormalization Strategy
### Overview
Replace the single `triples` table with three purpose-built tables, each optimized for specific query patterns. This eliminates the need for secondary indexes and ALLOW FILTERING while providing optimal performance for all query types.
### New Schema Design
**Table 1: Subject-Centric Queries**
```sql
CREATE TABLE triples_by_subject (
collection text,
s text,
p text,
o text,
PRIMARY KEY ((collection, s), p, o)
);
```
- **Optimizes:** get_s, get_sp, get_spo, get_os
- **Partition Key:** (collection, s) - Better distribution than collection alone
- **Clustering:** (p, o) - Enables efficient predicate/object lookups for a subject
**Table 2: Predicate-Object Queries**
```sql
CREATE TABLE triples_by_po (
collection text,
p text,
o text,
s text,
PRIMARY KEY ((collection, p), o, s)
);
```
- **Optimizes:** get_p, get_po (eliminates ALLOW FILTERING!)
- **Partition Key:** (collection, p) - Direct access by predicate
- **Clustering:** (o, s) - Efficient object-subject traversal
**Table 3: Object-Centric Queries**
```sql
CREATE TABLE triples_by_object (
collection text,
o text,
s text,
p text,
PRIMARY KEY ((collection, o), s, p)
);
```
- **Optimizes:** get_o, get_os
- **Partition Key:** (collection, o) - Direct access by object
- **Clustering:** (s, p) - Efficient subject-predicate traversal
### Query Mapping
| Original Query | Target Table | Performance Improvement |
|----------------|-------------|------------------------|
| get_all(collection) | triples_by_subject | Token-based pagination |
| get_s(collection, s) | triples_by_subject | Direct partition access |
| get_p(collection, p) | triples_by_po | Direct partition access |
| get_o(collection, o) | triples_by_object | Direct partition access |
| get_sp(collection, s, p) | triples_by_subject | Partition + clustering |
| get_po(collection, p, o) | triples_by_po | **No more ALLOW FILTERING!** |
| get_os(collection, o, s) | triples_by_subject | Partition + clustering |
| get_spo(collection, s, p, o) | triples_by_subject | Exact key lookup |
### Benefits
1. **Eliminates ALLOW FILTERING** - Every query has an optimal access path
2. **No Secondary Indexes** - Each table IS the index for its query pattern
3. **Better Data Distribution** - Composite partition keys spread load effectively
4. **Predictable Performance** - Query time proportional to result size, not total data
5. **Leverages Cassandra Strengths** - Designed for Cassandra's architecture
## Implementation Plan
### Files Requiring Changes
#### Primary Implementation File
**`trustgraph-flow/trustgraph/direct/cassandra_kg.py`** - Complete rewrite required
**Current Methods to Refactor:**
```python
# Schema initialization
def init(self) -> None # Replace single table with three tables
# Insert operations
def insert(self, collection, s, p, o) -> None # Write to all three tables
# Query operations (API unchanged, implementation optimized)
def get_all(self, collection, limit=50) # Use triples_by_subject
def get_s(self, collection, s, limit=10) # Use triples_by_subject
def get_p(self, collection, p, limit=10) # Use triples_by_po
def get_o(self, collection, o, limit=10) # Use triples_by_object
def get_sp(self, collection, s, p, limit=10) # Use triples_by_subject
def get_po(self, collection, p, o, limit=10) # Use triples_by_po (NO ALLOW FILTERING!)
def get_os(self, collection, o, s, limit=10) # Use triples_by_subject
def get_spo(self, collection, s, p, o, limit=10) # Use triples_by_subject
# Collection management
def delete_collection(self, collection) -> None # Delete from all three tables
```
#### Integration Files (No Logic Changes Required)
**`trustgraph-flow/trustgraph/storage/triples/cassandra/write.py`**
- No changes needed - uses existing KnowledgeGraph API
- Benefits automatically from performance improvements
**`trustgraph-flow/trustgraph/query/triples/cassandra/service.py`**
- No changes needed - uses existing KnowledgeGraph API
- Benefits automatically from performance improvements
### Test Files Requiring Updates
#### Unit Tests
**`tests/unit/test_storage/test_triples_cassandra_storage.py`**
- Update test expectations for schema changes
- Add tests for multi-table consistency
- Verify no ALLOW FILTERING in query plans
**`tests/unit/test_query/test_triples_cassandra_query.py`**
- Update performance assertions
- Test all 8 query patterns against new tables
- Verify query routing to correct tables
#### Integration Tests
**`tests/integration/test_cassandra_integration.py`**
- End-to-end testing with new schema
- Performance benchmarking comparisons
- Data consistency verification across tables
**`tests/unit/test_storage/test_cassandra_config_integration.py`**
- Update schema validation tests
- Test migration scenarios
### Implementation Strategy
#### Phase 1: Schema and Core Methods
1. **Rewrite `init()` method** - Create three tables instead of one
2. **Rewrite `insert()` method** - Batch writes to all three tables
3. **Implement prepared statements** - For optimal performance
4. **Add table routing logic** - Direct queries to optimal tables
#### Phase 2: Query Method Optimization
1. **Rewrite each get_* method** to use optimal table
2. **Remove all ALLOW FILTERING** usage
3. **Implement efficient clustering key usage**
4. **Add query performance logging**
#### Phase 3: Collection Management
1. **Update `delete_collection()`** - Remove from all three tables
2. **Add consistency verification** - Ensure all tables stay in sync
3. **Implement batch operations** - For atomic multi-table operations
### Key Implementation Details
#### Batch Write Strategy
```python
def insert(self, collection, s, p, o):
batch = BatchStatement()
# Insert into all three tables
batch.add(SimpleStatement(
"INSERT INTO triples_by_subject (collection, s, p, o) VALUES (?, ?, ?, ?)"
), (collection, s, p, o))
batch.add(SimpleStatement(
"INSERT INTO triples_by_po (collection, p, o, s) VALUES (?, ?, ?, ?)"
), (collection, p, o, s))
batch.add(SimpleStatement(
"INSERT INTO triples_by_object (collection, o, s, p) VALUES (?, ?, ?, ?)"
), (collection, o, s, p))
self.session.execute(batch)
```
#### Query Routing Logic
```python
def get_po(self, collection, p, o, limit=10):
# Route to triples_by_po table - NO ALLOW FILTERING!
return self.session.execute(
"SELECT s FROM triples_by_po WHERE collection = ? AND p = ? AND o = ? LIMIT ?",
(collection, p, o, limit)
)
```
#### Prepared Statement Optimization
```python
def prepare_statements(self):
# Cache prepared statements for better performance
self.insert_subject_stmt = self.session.prepare(
"INSERT INTO triples_by_subject (collection, s, p, o) VALUES (?, ?, ?, ?)"
)
self.insert_po_stmt = self.session.prepare(
"INSERT INTO triples_by_po (collection, p, o, s) VALUES (?, ?, ?, ?)"
)
# ... etc for all tables and queries
```
## Migration Strategy
### Data Migration Approach
#### Option 1: Blue-Green Deployment (Recommended)
1. **Deploy new schema alongside existing** - Use different table names temporarily
2. **Dual-write period** - Write to both old and new schemas during transition
3. **Background migration** - Copy existing data to new tables
4. **Switch reads** - Route queries to new tables once data is migrated
5. **Drop old tables** - After verification period
#### Option 2: In-Place Migration
1. **Schema addition** - Create new tables in existing keyspace
2. **Data migration script** - Batch copy from old table to new tables
3. **Application update** - Deploy new code after migration completes
4. **Old table cleanup** - Remove old table and indexes
### Backward Compatibility
#### Deployment Strategy
```python
# Environment variable to control table usage during migration
USE_LEGACY_TABLES = os.getenv('CASSANDRA_USE_LEGACY', 'false').lower() == 'true'
class KnowledgeGraph:
def __init__(self, ...):
if USE_LEGACY_TABLES:
self.init_legacy_schema()
else:
self.init_optimized_schema()
```
#### Migration Script
```python
def migrate_data():
# Read from old table
old_triples = session.execute("SELECT collection, s, p, o FROM triples")
# Batch write to new tables
for batch in batched(old_triples, 100):
batch_stmt = BatchStatement()
for row in batch:
# Add to all three new tables
batch_stmt.add(insert_subject_stmt, row)
batch_stmt.add(insert_po_stmt, (row.collection, row.p, row.o, row.s))
batch_stmt.add(insert_object_stmt, (row.collection, row.o, row.s, row.p))
session.execute(batch_stmt)
```
### Validation Strategy
#### Data Consistency Checks
```python
def validate_migration():
# Count total records in old vs new tables
old_count = session.execute("SELECT COUNT(*) FROM triples WHERE collection = ?", (collection,))
new_count = session.execute("SELECT COUNT(*) FROM triples_by_subject WHERE collection = ?", (collection,))
assert old_count == new_count, f"Record count mismatch: {old_count} vs {new_count}"
# Spot check random samples
sample_queries = generate_test_queries()
for query in sample_queries:
old_result = execute_legacy_query(query)
new_result = execute_optimized_query(query)
assert old_result == new_result, f"Query results differ for {query}"
```
## Testing Strategy
### Performance Testing
#### Benchmark Scenarios
1. **Query Performance Comparison**
- Before/after performance metrics for all 8 query types
- Focus on get_po performance improvement (eliminate ALLOW FILTERING)
- Measure query latency under various data sizes
2. **Load Testing**
- Concurrent query execution
- Write throughput with batch operations
- Memory and CPU utilization
3. **Scalability Testing**
- Performance with increasing collection sizes
- Multi-collection query distribution
- Cluster node utilization
#### Test Data Sets
- **Small:** 10K triples per collection
- **Medium:** 100K triples per collection
- **Large:** 1M+ triples per collection
- **Multiple collections:** Test partition distribution
### Functional Testing
#### Unit Test Updates
```python
# Example test structure for new implementation
class TestCassandraKGPerformance:
def test_get_po_no_allow_filtering(self):
# Verify get_po queries don't use ALLOW FILTERING
with patch('cassandra.cluster.Session.execute') as mock_execute:
kg.get_po('test_collection', 'predicate', 'object')
executed_query = mock_execute.call_args[0][0]
assert 'ALLOW FILTERING' not in executed_query
def test_multi_table_consistency(self):
# Verify all tables stay in sync
kg.insert('test', 's1', 'p1', 'o1')
# Check all tables contain the triple
assert_triple_exists('triples_by_subject', 'test', 's1', 'p1', 'o1')
assert_triple_exists('triples_by_po', 'test', 'p1', 'o1', 's1')
assert_triple_exists('triples_by_object', 'test', 'o1', 's1', 'p1')
```
#### Integration Test Updates
```python
class TestCassandraIntegration:
def test_query_performance_regression(self):
# Ensure new implementation is faster than old
old_time = benchmark_legacy_get_po()
new_time = benchmark_optimized_get_po()
assert new_time < old_time * 0.5 # At least 50% improvement
def test_end_to_end_workflow(self):
# Test complete write -> query -> delete cycle
# Verify no performance degradation in integration
```
### Rollback Plan
#### Quick Rollback Strategy
1. **Environment variable toggle** - Switch back to legacy tables immediately
2. **Keep legacy tables** - Don't drop until performance is proven
3. **Monitoring alerts** - Automated rollback triggers based on error rates/latency
#### Rollback Validation
```python
def rollback_to_legacy():
# Set environment variable
os.environ['CASSANDRA_USE_LEGACY'] = 'true'
# Restart services to pick up change
restart_cassandra_services()
# Validate functionality
run_smoke_tests()
```
## Risks and Considerations
### Performance Risks
- **Write latency increase** - 3x write operations per insert
- **Storage overhead** - 3x storage requirement
- **Batch write failures** - Need proper error handling
### Operational Risks
- **Migration complexity** - Data migration for large datasets
- **Consistency challenges** - Ensuring all tables stay synchronized
- **Monitoring gaps** - Need new metrics for multi-table operations
### Mitigation Strategies
1. **Gradual rollout** - Start with small collections
2. **Comprehensive monitoring** - Track all performance metrics
3. **Automated validation** - Continuous consistency checking
4. **Quick rollback capability** - Environment-based table selection
## Success Criteria
### Performance Improvements
- [ ] **Eliminate ALLOW FILTERING** - get_po and get_os queries run without filtering
- [ ] **Query latency reduction** - 50%+ improvement in query response times
- [ ] **Better load distribution** - No hot partitions, even load across cluster nodes
- [ ] **Scalable performance** - Query time proportional to result size, not total data
### Functional Requirements
- [ ] **API compatibility** - All existing code continues to work unchanged
- [ ] **Data consistency** - All three tables remain synchronized
- [ ] **Zero data loss** - Migration preserves all existing triples
- [ ] **Backward compatibility** - Ability to rollback to legacy schema
### Operational Requirements
- [ ] **Safe migration** - Blue-green deployment with rollback capability
- [ ] **Monitoring coverage** - Comprehensive metrics for multi-table operations
- [ ] **Test coverage** - All query patterns tested with performance benchmarks
- [ ] **Documentation** - Updated deployment and operational procedures
## Timeline
### Phase 1: Implementation
- [ ] Rewrite `cassandra_kg.py` with multi-table schema
- [ ] Implement batch write operations
- [ ] Add prepared statement optimization
- [ ] Update unit tests
### Phase 2: Integration Testing
- [ ] Update integration tests
- [ ] Performance benchmarking
- [ ] Load testing with realistic data volumes
- [ ] Validation scripts for data consistency
### Phase 3: Migration Planning
- [ ] Blue-green deployment scripts
- [ ] Data migration tools
- [ ] Monitoring dashboard updates
- [ ] Rollback procedures
### Phase 4: Production Deployment
- [ ] Staged rollout to production
- [ ] Performance monitoring and validation
- [ ] Legacy table cleanup
- [ ] Documentation updates
## Conclusion
This multi-table denormalization strategy directly addresses the two critical performance bottlenecks:
1. **Eliminates expensive ALLOW FILTERING** by providing optimal table structures for each query pattern
2. **Improves clustering effectiveness** through composite partition keys that distribute load properly
The approach leverages Cassandra's strengths while maintaining complete API compatibility, ensuring existing code benefits automatically from the performance improvements.