mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 16:36:21 +02:00
582 lines
20 KiB
Markdown
582 lines
20 KiB
Markdown
# Tech Spec: Cassandra Knowledge Base Performance Refactor
|
|
|
|
**Status:** Draft
|
|
**Author:** Assistant
|
|
**Date:** 2025-09-18
|
|
|
|
## Overview
|
|
|
|
This specification addresses performance issues in the TrustGraph Cassandra knowledge base implementation and proposes optimizations for RDF triple storage and querying.
|
|
|
|
## Current Implementation
|
|
|
|
### Schema Design
|
|
|
|
The current implementation uses a single table design in `trustgraph-flow/trustgraph/direct/cassandra_kg.py`:
|
|
|
|
```sql
|
|
CREATE TABLE triples (
|
|
collection text,
|
|
s text,
|
|
p text,
|
|
o text,
|
|
PRIMARY KEY (collection, s, p, o)
|
|
);
|
|
```
|
|
|
|
**Secondary Indexes:**
|
|
- `triples_s` ON `s` (subject)
|
|
- `triples_p` ON `p` (predicate)
|
|
- `triples_o` ON `o` (object)
|
|
|
|
### Query Patterns
|
|
|
|
The current implementation supports 8 distinct query patterns:
|
|
|
|
1. **get_all(collection, limit=50)** - Retrieve all triples for a collection
|
|
```sql
|
|
SELECT s, p, o FROM triples WHERE collection = ? LIMIT 50
|
|
```
|
|
|
|
2. **get_s(collection, s, limit=10)** - Query by subject
|
|
```sql
|
|
SELECT p, o FROM triples WHERE collection = ? AND s = ? LIMIT 10
|
|
```
|
|
|
|
3. **get_p(collection, p, limit=10)** - Query by predicate
|
|
```sql
|
|
SELECT s, o FROM triples WHERE collection = ? AND p = ? LIMIT 10
|
|
```
|
|
|
|
4. **get_o(collection, o, limit=10)** - Query by object
|
|
```sql
|
|
SELECT s, p FROM triples WHERE collection = ? AND o = ? LIMIT 10
|
|
```
|
|
|
|
5. **get_sp(collection, s, p, limit=10)** - Query by subject + predicate
|
|
```sql
|
|
SELECT o FROM triples WHERE collection = ? AND s = ? AND p = ? LIMIT 10
|
|
```
|
|
|
|
6. **get_po(collection, p, o, limit=10)** - Query by predicate + object ⚠️
|
|
```sql
|
|
SELECT s FROM triples WHERE collection = ? AND p = ? AND o = ? LIMIT 10 ALLOW FILTERING
|
|
```
|
|
|
|
7. **get_os(collection, o, s, limit=10)** - Query by object + subject ⚠️
|
|
```sql
|
|
SELECT p FROM triples WHERE collection = ? AND o = ? AND s = ? LIMIT 10 ALLOW FILTERING
|
|
```
|
|
|
|
8. **get_spo(collection, s, p, o, limit=10)** - Exact triple match
|
|
```sql
|
|
SELECT s as x FROM triples WHERE collection = ? AND s = ? AND p = ? AND o = ? LIMIT 10
|
|
```
|
|
|
|
### Current Architecture
|
|
|
|
**File: `trustgraph-flow/trustgraph/direct/cassandra_kg.py`**
|
|
- Single `KnowledgeGraph` class handling all operations
|
|
- Connection pooling through global `_active_clusters` list
|
|
- Fixed table name: `"triples"`
|
|
- Keyspace per user model
|
|
- SimpleStrategy replication with factor 1
|
|
|
|
**Integration Points:**
|
|
- **Write Path:** `trustgraph-flow/trustgraph/storage/triples/cassandra/write.py`
|
|
- **Query Path:** `trustgraph-flow/trustgraph/query/triples/cassandra/service.py`
|
|
- **Knowledge Store:** `trustgraph-flow/trustgraph/tables/knowledge.py`
|
|
|
|
## Performance Issues Identified
|
|
|
|
### Schema-Level Issues
|
|
|
|
1. **Inefficient Primary Key Design**
|
|
- Current: `PRIMARY KEY (collection, s, p, o)`
|
|
- Results in poor clustering for common access patterns
|
|
- Forces expensive secondary index usage
|
|
|
|
2. **Secondary Index Overuse** ⚠️
|
|
- Three secondary indexes on high-cardinality columns (s, p, o)
|
|
- Secondary indexes in Cassandra are expensive and don't scale well
|
|
- Queries 6 & 7 require `ALLOW FILTERING` indicating poor data modeling
|
|
|
|
3. **Hot Partition Risk**
|
|
- Single partition key `collection` can create hot partitions
|
|
- Large collections will concentrate on single nodes
|
|
- No distribution strategy for load balancing
|
|
|
|
### Query-Level Issues
|
|
|
|
1. **ALLOW FILTERING Usage** ⚠️
|
|
- Two query types (get_po, get_os) require `ALLOW FILTERING`
|
|
- These queries scan multiple partitions and are extremely expensive
|
|
- Performance degrades linearly with data size
|
|
|
|
2. **Inefficient Access Patterns**
|
|
- No optimization for common RDF query patterns
|
|
- Missing compound indexes for frequent query combinations
|
|
- No consideration for graph traversal patterns
|
|
|
|
3. **Lack of Query Optimization**
|
|
- No prepared statements caching
|
|
- No query hints or optimization strategies
|
|
- No consideration for pagination beyond simple LIMIT
|
|
|
|
## Problem Statement
|
|
|
|
The current Cassandra knowledge base implementation has two critical performance bottlenecks:
|
|
|
|
### 1. Inefficient get_po Query Performance
|
|
|
|
The `get_po(collection, p, o)` query is extremely inefficient due to requiring `ALLOW FILTERING`:
|
|
|
|
```sql
|
|
SELECT s FROM triples WHERE collection = ? AND p = ? AND o = ? LIMIT 10 ALLOW FILTERING
|
|
```
|
|
|
|
**Why this is problematic:**
|
|
- `ALLOW FILTERING` forces Cassandra to scan all partitions within the collection
|
|
- Performance degrades linearly with data size
|
|
- This is a common RDF query pattern (finding subjects that have a specific predicate-object relationship)
|
|
- Creates significant load on the cluster as data grows
|
|
|
|
### 2. Poor Clustering Strategy
|
|
|
|
The current primary key `PRIMARY KEY (collection, s, p, o)` provides minimal clustering benefits:
|
|
|
|
**Issues with current clustering:**
|
|
- `collection` as partition key doesn't distribute data effectively
|
|
- Most collections contain diverse data making clustering ineffective
|
|
- No consideration for common access patterns in RDF queries
|
|
- Large collections create hot partitions on single nodes
|
|
- Clustering columns (s, p, o) don't optimize for typical graph traversal patterns
|
|
|
|
**Impact:**
|
|
- Queries don't benefit from data locality
|
|
- Poor cache utilization
|
|
- Uneven load distribution across cluster nodes
|
|
- Scalability bottlenecks as collections grow
|
|
|
|
## Proposed Solution: Multi-Table Denormalization Strategy
|
|
|
|
### Overview
|
|
|
|
Replace the single `triples` table with three purpose-built tables, each optimized for specific query patterns. This eliminates the need for secondary indexes and ALLOW FILTERING while providing optimal performance for all query types.
|
|
|
|
### New Schema Design
|
|
|
|
**Table 1: Subject-Centric Queries**
|
|
```sql
|
|
CREATE TABLE triples_by_subject (
|
|
collection text,
|
|
s text,
|
|
p text,
|
|
o text,
|
|
PRIMARY KEY ((collection, s), p, o)
|
|
);
|
|
```
|
|
- **Optimizes:** get_s, get_sp, get_spo, get_os
|
|
- **Partition Key:** (collection, s) - Better distribution than collection alone
|
|
- **Clustering:** (p, o) - Enables efficient predicate/object lookups for a subject
|
|
|
|
**Table 2: Predicate-Object Queries**
|
|
```sql
|
|
CREATE TABLE triples_by_po (
|
|
collection text,
|
|
p text,
|
|
o text,
|
|
s text,
|
|
PRIMARY KEY ((collection, p), o, s)
|
|
);
|
|
```
|
|
- **Optimizes:** get_p, get_po (eliminates ALLOW FILTERING!)
|
|
- **Partition Key:** (collection, p) - Direct access by predicate
|
|
- **Clustering:** (o, s) - Efficient object-subject traversal
|
|
|
|
**Table 3: Object-Centric Queries**
|
|
```sql
|
|
CREATE TABLE triples_by_object (
|
|
collection text,
|
|
o text,
|
|
s text,
|
|
p text,
|
|
PRIMARY KEY ((collection, o), s, p)
|
|
);
|
|
```
|
|
- **Optimizes:** get_o, get_os
|
|
- **Partition Key:** (collection, o) - Direct access by object
|
|
- **Clustering:** (s, p) - Efficient subject-predicate traversal
|
|
|
|
### Query Mapping
|
|
|
|
| Original Query | Target Table | Performance Improvement |
|
|
|----------------|-------------|------------------------|
|
|
| get_all(collection) | triples_by_subject | Token-based pagination |
|
|
| get_s(collection, s) | triples_by_subject | Direct partition access |
|
|
| get_p(collection, p) | triples_by_po | Direct partition access |
|
|
| get_o(collection, o) | triples_by_object | Direct partition access |
|
|
| get_sp(collection, s, p) | triples_by_subject | Partition + clustering |
|
|
| get_po(collection, p, o) | triples_by_po | **No more ALLOW FILTERING!** |
|
|
| get_os(collection, o, s) | triples_by_subject | Partition + clustering |
|
|
| get_spo(collection, s, p, o) | triples_by_subject | Exact key lookup |
|
|
|
|
### Benefits
|
|
|
|
1. **Eliminates ALLOW FILTERING** - Every query has an optimal access path
|
|
2. **No Secondary Indexes** - Each table IS the index for its query pattern
|
|
3. **Better Data Distribution** - Composite partition keys spread load effectively
|
|
4. **Predictable Performance** - Query time proportional to result size, not total data
|
|
5. **Leverages Cassandra Strengths** - Designed for Cassandra's architecture
|
|
|
|
## Implementation Plan
|
|
|
|
### Files Requiring Changes
|
|
|
|
#### Primary Implementation File
|
|
|
|
**`trustgraph-flow/trustgraph/direct/cassandra_kg.py`** - Complete rewrite required
|
|
|
|
**Current Methods to Refactor:**
|
|
```python
|
|
# Schema initialization
|
|
def init(self) -> None # Replace single table with three tables
|
|
|
|
# Insert operations
|
|
def insert(self, collection, s, p, o) -> None # Write to all three tables
|
|
|
|
# Query operations (API unchanged, implementation optimized)
|
|
def get_all(self, collection, limit=50) # Use triples_by_subject
|
|
def get_s(self, collection, s, limit=10) # Use triples_by_subject
|
|
def get_p(self, collection, p, limit=10) # Use triples_by_po
|
|
def get_o(self, collection, o, limit=10) # Use triples_by_object
|
|
def get_sp(self, collection, s, p, limit=10) # Use triples_by_subject
|
|
def get_po(self, collection, p, o, limit=10) # Use triples_by_po (NO ALLOW FILTERING!)
|
|
def get_os(self, collection, o, s, limit=10) # Use triples_by_subject
|
|
def get_spo(self, collection, s, p, o, limit=10) # Use triples_by_subject
|
|
|
|
# Collection management
|
|
def delete_collection(self, collection) -> None # Delete from all three tables
|
|
```
|
|
|
|
#### Integration Files (No Logic Changes Required)
|
|
|
|
**`trustgraph-flow/trustgraph/storage/triples/cassandra/write.py`**
|
|
- No changes needed - uses existing KnowledgeGraph API
|
|
- Benefits automatically from performance improvements
|
|
|
|
**`trustgraph-flow/trustgraph/query/triples/cassandra/service.py`**
|
|
- No changes needed - uses existing KnowledgeGraph API
|
|
- Benefits automatically from performance improvements
|
|
|
|
### Test Files Requiring Updates
|
|
|
|
#### Unit Tests
|
|
**`tests/unit/test_storage/test_triples_cassandra_storage.py`**
|
|
- Update test expectations for schema changes
|
|
- Add tests for multi-table consistency
|
|
- Verify no ALLOW FILTERING in query plans
|
|
|
|
**`tests/unit/test_query/test_triples_cassandra_query.py`**
|
|
- Update performance assertions
|
|
- Test all 8 query patterns against new tables
|
|
- Verify query routing to correct tables
|
|
|
|
#### Integration Tests
|
|
**`tests/integration/test_cassandra_integration.py`**
|
|
- End-to-end testing with new schema
|
|
- Performance benchmarking comparisons
|
|
- Data consistency verification across tables
|
|
|
|
**`tests/unit/test_storage/test_cassandra_config_integration.py`**
|
|
- Update schema validation tests
|
|
- Test migration scenarios
|
|
|
|
### Implementation Strategy
|
|
|
|
#### Phase 1: Schema and Core Methods
|
|
1. **Rewrite `init()` method** - Create three tables instead of one
|
|
2. **Rewrite `insert()` method** - Batch writes to all three tables
|
|
3. **Implement prepared statements** - For optimal performance
|
|
4. **Add table routing logic** - Direct queries to optimal tables
|
|
|
|
#### Phase 2: Query Method Optimization
|
|
1. **Rewrite each get_* method** to use optimal table
|
|
2. **Remove all ALLOW FILTERING** usage
|
|
3. **Implement efficient clustering key usage**
|
|
4. **Add query performance logging**
|
|
|
|
#### Phase 3: Collection Management
|
|
1. **Update `delete_collection()`** - Remove from all three tables
|
|
2. **Add consistency verification** - Ensure all tables stay in sync
|
|
3. **Implement batch operations** - For atomic multi-table operations
|
|
|
|
### Key Implementation Details
|
|
|
|
#### Batch Write Strategy
|
|
```python
|
|
def insert(self, collection, s, p, o):
|
|
batch = BatchStatement()
|
|
|
|
# Insert into all three tables
|
|
batch.add(SimpleStatement(
|
|
"INSERT INTO triples_by_subject (collection, s, p, o) VALUES (?, ?, ?, ?)"
|
|
), (collection, s, p, o))
|
|
|
|
batch.add(SimpleStatement(
|
|
"INSERT INTO triples_by_po (collection, p, o, s) VALUES (?, ?, ?, ?)"
|
|
), (collection, p, o, s))
|
|
|
|
batch.add(SimpleStatement(
|
|
"INSERT INTO triples_by_object (collection, o, s, p) VALUES (?, ?, ?, ?)"
|
|
), (collection, o, s, p))
|
|
|
|
self.session.execute(batch)
|
|
```
|
|
|
|
#### Query Routing Logic
|
|
```python
|
|
def get_po(self, collection, p, o, limit=10):
|
|
# Route to triples_by_po table - NO ALLOW FILTERING!
|
|
return self.session.execute(
|
|
"SELECT s FROM triples_by_po WHERE collection = ? AND p = ? AND o = ? LIMIT ?",
|
|
(collection, p, o, limit)
|
|
)
|
|
```
|
|
|
|
#### Prepared Statement Optimization
|
|
```python
|
|
def prepare_statements(self):
|
|
# Cache prepared statements for better performance
|
|
self.insert_subject_stmt = self.session.prepare(
|
|
"INSERT INTO triples_by_subject (collection, s, p, o) VALUES (?, ?, ?, ?)"
|
|
)
|
|
self.insert_po_stmt = self.session.prepare(
|
|
"INSERT INTO triples_by_po (collection, p, o, s) VALUES (?, ?, ?, ?)"
|
|
)
|
|
# ... etc for all tables and queries
|
|
```
|
|
|
|
## Migration Strategy
|
|
|
|
### Data Migration Approach
|
|
|
|
#### Option 1: Blue-Green Deployment (Recommended)
|
|
1. **Deploy new schema alongside existing** - Use different table names temporarily
|
|
2. **Dual-write period** - Write to both old and new schemas during transition
|
|
3. **Background migration** - Copy existing data to new tables
|
|
4. **Switch reads** - Route queries to new tables once data is migrated
|
|
5. **Drop old tables** - After verification period
|
|
|
|
#### Option 2: In-Place Migration
|
|
1. **Schema addition** - Create new tables in existing keyspace
|
|
2. **Data migration script** - Batch copy from old table to new tables
|
|
3. **Application update** - Deploy new code after migration completes
|
|
4. **Old table cleanup** - Remove old table and indexes
|
|
|
|
### Backward Compatibility
|
|
|
|
#### Deployment Strategy
|
|
```python
|
|
# Environment variable to control table usage during migration
|
|
USE_LEGACY_TABLES = os.getenv('CASSANDRA_USE_LEGACY', 'false').lower() == 'true'
|
|
|
|
class KnowledgeGraph:
|
|
def __init__(self, ...):
|
|
if USE_LEGACY_TABLES:
|
|
self.init_legacy_schema()
|
|
else:
|
|
self.init_optimized_schema()
|
|
```
|
|
|
|
#### Migration Script
|
|
```python
|
|
def migrate_data():
|
|
# Read from old table
|
|
old_triples = session.execute("SELECT collection, s, p, o FROM triples")
|
|
|
|
# Batch write to new tables
|
|
for batch in batched(old_triples, 100):
|
|
batch_stmt = BatchStatement()
|
|
for row in batch:
|
|
# Add to all three new tables
|
|
batch_stmt.add(insert_subject_stmt, row)
|
|
batch_stmt.add(insert_po_stmt, (row.collection, row.p, row.o, row.s))
|
|
batch_stmt.add(insert_object_stmt, (row.collection, row.o, row.s, row.p))
|
|
session.execute(batch_stmt)
|
|
```
|
|
|
|
### Validation Strategy
|
|
|
|
#### Data Consistency Checks
|
|
```python
|
|
def validate_migration():
|
|
# Count total records in old vs new tables
|
|
old_count = session.execute("SELECT COUNT(*) FROM triples WHERE collection = ?", (collection,))
|
|
new_count = session.execute("SELECT COUNT(*) FROM triples_by_subject WHERE collection = ?", (collection,))
|
|
|
|
assert old_count == new_count, f"Record count mismatch: {old_count} vs {new_count}"
|
|
|
|
# Spot check random samples
|
|
sample_queries = generate_test_queries()
|
|
for query in sample_queries:
|
|
old_result = execute_legacy_query(query)
|
|
new_result = execute_optimized_query(query)
|
|
assert old_result == new_result, f"Query results differ for {query}"
|
|
```
|
|
|
|
## Testing Strategy
|
|
|
|
### Performance Testing
|
|
|
|
#### Benchmark Scenarios
|
|
1. **Query Performance Comparison**
|
|
- Before/after performance metrics for all 8 query types
|
|
- Focus on get_po performance improvement (eliminate ALLOW FILTERING)
|
|
- Measure query latency under various data sizes
|
|
|
|
2. **Load Testing**
|
|
- Concurrent query execution
|
|
- Write throughput with batch operations
|
|
- Memory and CPU utilization
|
|
|
|
3. **Scalability Testing**
|
|
- Performance with increasing collection sizes
|
|
- Multi-collection query distribution
|
|
- Cluster node utilization
|
|
|
|
#### Test Data Sets
|
|
- **Small:** 10K triples per collection
|
|
- **Medium:** 100K triples per collection
|
|
- **Large:** 1M+ triples per collection
|
|
- **Multiple collections:** Test partition distribution
|
|
|
|
### Functional Testing
|
|
|
|
#### Unit Test Updates
|
|
```python
|
|
# Example test structure for new implementation
|
|
class TestCassandraKGPerformance:
|
|
def test_get_po_no_allow_filtering(self):
|
|
# Verify get_po queries don't use ALLOW FILTERING
|
|
with patch('cassandra.cluster.Session.execute') as mock_execute:
|
|
kg.get_po('test_collection', 'predicate', 'object')
|
|
executed_query = mock_execute.call_args[0][0]
|
|
assert 'ALLOW FILTERING' not in executed_query
|
|
|
|
def test_multi_table_consistency(self):
|
|
# Verify all tables stay in sync
|
|
kg.insert('test', 's1', 'p1', 'o1')
|
|
|
|
# Check all tables contain the triple
|
|
assert_triple_exists('triples_by_subject', 'test', 's1', 'p1', 'o1')
|
|
assert_triple_exists('triples_by_po', 'test', 'p1', 'o1', 's1')
|
|
assert_triple_exists('triples_by_object', 'test', 'o1', 's1', 'p1')
|
|
```
|
|
|
|
#### Integration Test Updates
|
|
```python
|
|
class TestCassandraIntegration:
|
|
def test_query_performance_regression(self):
|
|
# Ensure new implementation is faster than old
|
|
old_time = benchmark_legacy_get_po()
|
|
new_time = benchmark_optimized_get_po()
|
|
assert new_time < old_time * 0.5 # At least 50% improvement
|
|
|
|
def test_end_to_end_workflow(self):
|
|
# Test complete write -> query -> delete cycle
|
|
# Verify no performance degradation in integration
|
|
```
|
|
|
|
### Rollback Plan
|
|
|
|
#### Quick Rollback Strategy
|
|
1. **Environment variable toggle** - Switch back to legacy tables immediately
|
|
2. **Keep legacy tables** - Don't drop until performance is proven
|
|
3. **Monitoring alerts** - Automated rollback triggers based on error rates/latency
|
|
|
|
#### Rollback Validation
|
|
```python
|
|
def rollback_to_legacy():
|
|
# Set environment variable
|
|
os.environ['CASSANDRA_USE_LEGACY'] = 'true'
|
|
|
|
# Restart services to pick up change
|
|
restart_cassandra_services()
|
|
|
|
# Validate functionality
|
|
run_smoke_tests()
|
|
```
|
|
|
|
## Risks and Considerations
|
|
|
|
### Performance Risks
|
|
- **Write latency increase** - 3x write operations per insert
|
|
- **Storage overhead** - 3x storage requirement
|
|
- **Batch write failures** - Need proper error handling
|
|
|
|
### Operational Risks
|
|
- **Migration complexity** - Data migration for large datasets
|
|
- **Consistency challenges** - Ensuring all tables stay synchronized
|
|
- **Monitoring gaps** - Need new metrics for multi-table operations
|
|
|
|
### Mitigation Strategies
|
|
1. **Gradual rollout** - Start with small collections
|
|
2. **Comprehensive monitoring** - Track all performance metrics
|
|
3. **Automated validation** - Continuous consistency checking
|
|
4. **Quick rollback capability** - Environment-based table selection
|
|
|
|
## Success Criteria
|
|
|
|
### Performance Improvements
|
|
- [ ] **Eliminate ALLOW FILTERING** - get_po and get_os queries run without filtering
|
|
- [ ] **Query latency reduction** - 50%+ improvement in query response times
|
|
- [ ] **Better load distribution** - No hot partitions, even load across cluster nodes
|
|
- [ ] **Scalable performance** - Query time proportional to result size, not total data
|
|
|
|
### Functional Requirements
|
|
- [ ] **API compatibility** - All existing code continues to work unchanged
|
|
- [ ] **Data consistency** - All three tables remain synchronized
|
|
- [ ] **Zero data loss** - Migration preserves all existing triples
|
|
- [ ] **Backward compatibility** - Ability to rollback to legacy schema
|
|
|
|
### Operational Requirements
|
|
- [ ] **Safe migration** - Blue-green deployment with rollback capability
|
|
- [ ] **Monitoring coverage** - Comprehensive metrics for multi-table operations
|
|
- [ ] **Test coverage** - All query patterns tested with performance benchmarks
|
|
- [ ] **Documentation** - Updated deployment and operational procedures
|
|
|
|
## Timeline
|
|
|
|
### Phase 1: Implementation
|
|
- [ ] Rewrite `cassandra_kg.py` with multi-table schema
|
|
- [ ] Implement batch write operations
|
|
- [ ] Add prepared statement optimization
|
|
- [ ] Update unit tests
|
|
|
|
### Phase 2: Integration Testing
|
|
- [ ] Update integration tests
|
|
- [ ] Performance benchmarking
|
|
- [ ] Load testing with realistic data volumes
|
|
- [ ] Validation scripts for data consistency
|
|
|
|
### Phase 3: Migration Planning
|
|
- [ ] Blue-green deployment scripts
|
|
- [ ] Data migration tools
|
|
- [ ] Monitoring dashboard updates
|
|
- [ ] Rollback procedures
|
|
|
|
### Phase 4: Production Deployment
|
|
- [ ] Staged rollout to production
|
|
- [ ] Performance monitoring and validation
|
|
- [ ] Legacy table cleanup
|
|
- [ ] Documentation updates
|
|
|
|
## Conclusion
|
|
|
|
This multi-table denormalization strategy directly addresses the two critical performance bottlenecks:
|
|
|
|
1. **Eliminates expensive ALLOW FILTERING** by providing optimal table structures for each query pattern
|
|
2. **Improves clustering effectiveness** through composite partition keys that distribute load properly
|
|
|
|
The approach leverages Cassandra's strengths while maintaining complete API compatibility, ensuring existing code benefits automatically from the performance improvements.
|