Release 1.4 -> master (#524)

Catch up
This commit is contained in:
cybermaggedon 2025-09-20 16:00:37 +01:00 committed by GitHub
parent a8e437fc7f
commit 6c7af8789d
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
216 changed files with 31360 additions and 1611 deletions

View file

@ -0,0 +1,331 @@
# Tech Spec: Cassandra Configuration Consolidation
**Status:** Draft
**Author:** Assistant
**Date:** 2024-09-03
## Overview
This specification addresses the inconsistent naming and configuration patterns for Cassandra connection parameters across the TrustGraph codebase. Currently, two different parameter naming schemes exist (`cassandra_*` vs `graph_*`), leading to confusion and maintenance complexity.
## Problem Statement
The codebase currently uses two distinct sets of Cassandra configuration parameters:
1. **Knowledge/Config/Library modules** use:
- `cassandra_host` (list of hosts)
- `cassandra_user`
- `cassandra_password`
2. **Graph/Storage modules** use:
- `graph_host` (single host, sometimes converted to list)
- `graph_username`
- `graph_password`
3. **Inconsistent command-line exposure**:
- Some processors (e.g., `kg-store`) don't expose Cassandra settings as command-line arguments
- Other processors expose them with different names and formats
- Help text doesn't reflect environment variable defaults
Both parameter sets connect to the same Cassandra cluster but with different naming conventions, causing:
- Configuration confusion for users
- Increased maintenance burden
- Inconsistent documentation
- Potential for misconfiguration
- Inability to override settings via command-line in some processors
## Proposed Solution
### 1. Standardize Parameter Names
All modules will use consistent `cassandra_*` parameter names:
- `cassandra_host` - List of hosts (internally stored as list)
- `cassandra_username` - Username for authentication
- `cassandra_password` - Password for authentication
### 2. Command-Line Arguments
All processors MUST expose Cassandra configuration via command-line arguments:
- `--cassandra-host` - Comma-separated list of hosts
- `--cassandra-username` - Username for authentication
- `--cassandra-password` - Password for authentication
### 3. Environment Variable Fallback
If command-line parameters are not explicitly provided, the system will check environment variables:
- `CASSANDRA_HOST` - Comma-separated list of hosts
- `CASSANDRA_USERNAME` - Username for authentication
- `CASSANDRA_PASSWORD` - Password for authentication
### 4. Default Values
If neither command-line parameters nor environment variables are specified:
- `cassandra_host` defaults to `["cassandra"]`
- `cassandra_username` defaults to `None` (no authentication)
- `cassandra_password` defaults to `None` (no authentication)
### 5. Help Text Requirements
The `--help` output must:
- Show environment variable values as defaults when set
- Never display password values (show `****` or `<set>` instead)
- Clearly indicate the resolution order in help text
Example help output:
```
--cassandra-host HOST
Cassandra host list, comma-separated (default: prod-cluster-1,prod-cluster-2)
[from CASSANDRA_HOST environment variable]
--cassandra-username USERNAME
Cassandra username (default: cassandra_user)
[from CASSANDRA_USERNAME environment variable]
--cassandra-password PASSWORD
Cassandra password (default: <set from environment>)
```
## Implementation Details
### Parameter Resolution Order
For each Cassandra parameter, the resolution order will be:
1. Command-line argument value
2. Environment variable (`CASSANDRA_*`)
3. Default value
### Host Parameter Handling
The `cassandra_host` parameter:
- Command-line accepts comma-separated string: `--cassandra-host "host1,host2,host3"`
- Environment variable accepts comma-separated string: `CASSANDRA_HOST="host1,host2,host3"`
- Internally always stored as list: `["host1", "host2", "host3"]`
- Single host: `"localhost"` → converted to `["localhost"]`
- Already a list: `["host1", "host2"]` → used as-is
### Authentication Logic
Authentication will be used when both `cassandra_username` and `cassandra_password` are provided:
```python
if cassandra_username and cassandra_password:
# Use SSL context and PlainTextAuthProvider
else:
# Connect without authentication
```
## Files to Modify
### Modules using `graph_*` parameters (to be changed):
- `trustgraph-flow/trustgraph/storage/triples/cassandra/write.py`
- `trustgraph-flow/trustgraph/storage/objects/cassandra/write.py`
- `trustgraph-flow/trustgraph/storage/rows/cassandra/write.py`
- `trustgraph-flow/trustgraph/query/triples/cassandra/service.py`
### Modules using `cassandra_*` parameters (to be updated with env fallback):
- `trustgraph-flow/trustgraph/tables/config.py`
- `trustgraph-flow/trustgraph/tables/knowledge.py`
- `trustgraph-flow/trustgraph/tables/library.py`
- `trustgraph-flow/trustgraph/storage/knowledge/store.py`
- `trustgraph-flow/trustgraph/cores/knowledge.py`
- `trustgraph-flow/trustgraph/librarian/librarian.py`
- `trustgraph-flow/trustgraph/librarian/service.py`
- `trustgraph-flow/trustgraph/config/service/service.py`
- `trustgraph-flow/trustgraph/cores/service.py`
### Test Files to Update:
- `tests/unit/test_cores/test_knowledge_manager.py`
- `tests/unit/test_storage/test_triples_cassandra_storage.py`
- `tests/unit/test_query/test_triples_cassandra_query.py`
- `tests/integration/test_objects_cassandra_integration.py`
## Implementation Strategy
### Phase 1: Create Common Configuration Helper
Create utility functions to standardize Cassandra configuration across all processors:
```python
import os
import argparse
def get_cassandra_defaults():
"""Get default values from environment variables or fallback."""
return {
'host': os.getenv('CASSANDRA_HOST', 'cassandra'),
'username': os.getenv('CASSANDRA_USERNAME'),
'password': os.getenv('CASSANDRA_PASSWORD')
}
def add_cassandra_args(parser: argparse.ArgumentParser):
"""
Add standardized Cassandra arguments to an argument parser.
Shows environment variable values in help text.
"""
defaults = get_cassandra_defaults()
# Format help text with env var indication
host_help = f"Cassandra host list, comma-separated (default: {defaults['host']})"
if 'CASSANDRA_HOST' in os.environ:
host_help += " [from CASSANDRA_HOST]"
username_help = f"Cassandra username"
if defaults['username']:
username_help += f" (default: {defaults['username']})"
if 'CASSANDRA_USERNAME' in os.environ:
username_help += " [from CASSANDRA_USERNAME]"
password_help = "Cassandra password"
if defaults['password']:
password_help += " (default: <set>)"
if 'CASSANDRA_PASSWORD' in os.environ:
password_help += " [from CASSANDRA_PASSWORD]"
parser.add_argument(
'--cassandra-host',
default=defaults['host'],
help=host_help
)
parser.add_argument(
'--cassandra-username',
default=defaults['username'],
help=username_help
)
parser.add_argument(
'--cassandra-password',
default=defaults['password'],
help=password_help
)
def resolve_cassandra_config(args) -> tuple[list[str], str|None, str|None]:
"""
Convert argparse args to Cassandra configuration.
Returns:
tuple: (hosts_list, username, password)
"""
# Convert host string to list
if isinstance(args.cassandra_host, str):
hosts = [h.strip() for h in args.cassandra_host.split(',')]
else:
hosts = args.cassandra_host
return hosts, args.cassandra_username, args.cassandra_password
```
### Phase 2: Update Modules Using `graph_*` Parameters
1. Change parameter names from `graph_*` to `cassandra_*`
2. Replace custom `add_args()` methods with standardized `add_cassandra_args()`
3. Use the common configuration helper functions
4. Update documentation strings
Example transformation:
```python
# OLD CODE
@staticmethod
def add_args(parser):
parser.add_argument(
'-g', '--graph-host',
default="localhost",
help=f'Graph host (default: localhost)'
)
parser.add_argument(
'--graph-username',
default=None,
help=f'Cassandra username'
)
# NEW CODE
@staticmethod
def add_args(parser):
FlowProcessor.add_args(parser)
add_cassandra_args(parser) # Use standard helper
```
### Phase 3: Update Modules Using `cassandra_*` Parameters
1. Add command-line argument support where missing (e.g., `kg-store`)
2. Replace existing argument definitions with `add_cassandra_args()`
3. Use `resolve_cassandra_config()` for consistent resolution
4. Ensure consistent host list handling
### Phase 4: Update Tests and Documentation
1. Update all test files to use new parameter names
2. Update CLI documentation
3. Update API documentation
4. Add environment variable documentation
## Backward Compatibility
To maintain backward compatibility during transition:
1. **Deprecation warnings** for `graph_*` parameters
2. **Parameter aliasing** - accept both old and new names initially
3. **Phased rollout** over multiple releases
4. **Documentation updates** with migration guide
Example backward compatibility code:
```python
def __init__(self, **params):
# Handle deprecated graph_* parameters
if 'graph_host' in params:
warnings.warn("graph_host is deprecated, use cassandra_host", DeprecationWarning)
params.setdefault('cassandra_host', params.pop('graph_host'))
if 'graph_username' in params:
warnings.warn("graph_username is deprecated, use cassandra_username", DeprecationWarning)
params.setdefault('cassandra_username', params.pop('graph_username'))
# ... continue with standard resolution
```
## Testing Strategy
1. **Unit tests** for configuration resolution logic
2. **Integration tests** with various configuration combinations
3. **Environment variable tests**
4. **Backward compatibility tests** with deprecated parameters
5. **Docker compose tests** with environment variables
## Documentation Updates
1. Update all CLI command documentation
2. Update API documentation
3. Create migration guide
4. Update Docker compose examples
5. Update configuration reference documentation
## Risks and Mitigation
| Risk | Impact | Mitigation |
|------|--------|------------|
| Breaking changes for users | High | Implement backward compatibility period |
| Configuration confusion during transition | Medium | Clear documentation and deprecation warnings |
| Test failures | Medium | Comprehensive test updates |
| Docker deployment issues | High | Update all Docker compose examples |
## Success Criteria
- [ ] All modules use consistent `cassandra_*` parameter names
- [ ] All processors expose Cassandra settings via command-line arguments
- [ ] Command-line help text shows environment variable defaults
- [ ] Password values are never displayed in help text
- [ ] Environment variable fallback works correctly
- [ ] `cassandra_host` is consistently handled as a list internally
- [ ] Backward compatibility maintained for at least 2 releases
- [ ] All tests pass with new configuration system
- [ ] Documentation fully updated
- [ ] Docker compose examples work with environment variables
## Timeline
- **Week 1:** Implement common configuration helper and update `graph_*` modules
- **Week 2:** Add environment variable support to existing `cassandra_*` modules
- **Week 3:** Update tests and documentation
- **Week 4:** Integration testing and bug fixes
## Future Considerations
- Consider extending this pattern to other database configurations (e.g., Elasticsearch)
- Implement configuration validation and better error messages
- Add support for Cassandra connection pooling configuration
- Consider adding configuration file support (.env files)

View file

@ -0,0 +1,582 @@
# Tech Spec: Cassandra Knowledge Base Performance Refactor
**Status:** Draft
**Author:** Assistant
**Date:** 2025-09-18
## Overview
This specification addresses performance issues in the TrustGraph Cassandra knowledge base implementation and proposes optimizations for RDF triple storage and querying.
## Current Implementation
### Schema Design
The current implementation uses a single table design in `trustgraph-flow/trustgraph/direct/cassandra_kg.py`:
```sql
CREATE TABLE triples (
collection text,
s text,
p text,
o text,
PRIMARY KEY (collection, s, p, o)
);
```
**Secondary Indexes:**
- `triples_s` ON `s` (subject)
- `triples_p` ON `p` (predicate)
- `triples_o` ON `o` (object)
### Query Patterns
The current implementation supports 8 distinct query patterns:
1. **get_all(collection, limit=50)** - Retrieve all triples for a collection
```sql
SELECT s, p, o FROM triples WHERE collection = ? LIMIT 50
```
2. **get_s(collection, s, limit=10)** - Query by subject
```sql
SELECT p, o FROM triples WHERE collection = ? AND s = ? LIMIT 10
```
3. **get_p(collection, p, limit=10)** - Query by predicate
```sql
SELECT s, o FROM triples WHERE collection = ? AND p = ? LIMIT 10
```
4. **get_o(collection, o, limit=10)** - Query by object
```sql
SELECT s, p FROM triples WHERE collection = ? AND o = ? LIMIT 10
```
5. **get_sp(collection, s, p, limit=10)** - Query by subject + predicate
```sql
SELECT o FROM triples WHERE collection = ? AND s = ? AND p = ? LIMIT 10
```
6. **get_po(collection, p, o, limit=10)** - Query by predicate + object ⚠️
```sql
SELECT s FROM triples WHERE collection = ? AND p = ? AND o = ? LIMIT 10 ALLOW FILTERING
```
7. **get_os(collection, o, s, limit=10)** - Query by object + subject ⚠️
```sql
SELECT p FROM triples WHERE collection = ? AND o = ? AND s = ? LIMIT 10 ALLOW FILTERING
```
8. **get_spo(collection, s, p, o, limit=10)** - Exact triple match
```sql
SELECT s as x FROM triples WHERE collection = ? AND s = ? AND p = ? AND o = ? LIMIT 10
```
### Current Architecture
**File: `trustgraph-flow/trustgraph/direct/cassandra_kg.py`**
- Single `KnowledgeGraph` class handling all operations
- Connection pooling through global `_active_clusters` list
- Fixed table name: `"triples"`
- Keyspace per user model
- SimpleStrategy replication with factor 1
**Integration Points:**
- **Write Path:** `trustgraph-flow/trustgraph/storage/triples/cassandra/write.py`
- **Query Path:** `trustgraph-flow/trustgraph/query/triples/cassandra/service.py`
- **Knowledge Store:** `trustgraph-flow/trustgraph/tables/knowledge.py`
## Performance Issues Identified
### Schema-Level Issues
1. **Inefficient Primary Key Design**
- Current: `PRIMARY KEY (collection, s, p, o)`
- Results in poor clustering for common access patterns
- Forces expensive secondary index usage
2. **Secondary Index Overuse** ⚠️
- Three secondary indexes on high-cardinality columns (s, p, o)
- Secondary indexes in Cassandra are expensive and don't scale well
- Queries 6 & 7 require `ALLOW FILTERING` indicating poor data modeling
3. **Hot Partition Risk**
- Single partition key `collection` can create hot partitions
- Large collections will concentrate on single nodes
- No distribution strategy for load balancing
### Query-Level Issues
1. **ALLOW FILTERING Usage** ⚠️
- Two query types (get_po, get_os) require `ALLOW FILTERING`
- These queries scan multiple partitions and are extremely expensive
- Performance degrades linearly with data size
2. **Inefficient Access Patterns**
- No optimization for common RDF query patterns
- Missing compound indexes for frequent query combinations
- No consideration for graph traversal patterns
3. **Lack of Query Optimization**
- No prepared statements caching
- No query hints or optimization strategies
- No consideration for pagination beyond simple LIMIT
## Problem Statement
The current Cassandra knowledge base implementation has two critical performance bottlenecks:
### 1. Inefficient get_po Query Performance
The `get_po(collection, p, o)` query is extremely inefficient due to requiring `ALLOW FILTERING`:
```sql
SELECT s FROM triples WHERE collection = ? AND p = ? AND o = ? LIMIT 10 ALLOW FILTERING
```
**Why this is problematic:**
- `ALLOW FILTERING` forces Cassandra to scan all partitions within the collection
- Performance degrades linearly with data size
- This is a common RDF query pattern (finding subjects that have a specific predicate-object relationship)
- Creates significant load on the cluster as data grows
### 2. Poor Clustering Strategy
The current primary key `PRIMARY KEY (collection, s, p, o)` provides minimal clustering benefits:
**Issues with current clustering:**
- `collection` as partition key doesn't distribute data effectively
- Most collections contain diverse data making clustering ineffective
- No consideration for common access patterns in RDF queries
- Large collections create hot partitions on single nodes
- Clustering columns (s, p, o) don't optimize for typical graph traversal patterns
**Impact:**
- Queries don't benefit from data locality
- Poor cache utilization
- Uneven load distribution across cluster nodes
- Scalability bottlenecks as collections grow
## Proposed Solution: Multi-Table Denormalization Strategy
### Overview
Replace the single `triples` table with three purpose-built tables, each optimized for specific query patterns. This eliminates the need for secondary indexes and ALLOW FILTERING while providing optimal performance for all query types.
### New Schema Design
**Table 1: Subject-Centric Queries**
```sql
CREATE TABLE triples_by_subject (
collection text,
s text,
p text,
o text,
PRIMARY KEY ((collection, s), p, o)
);
```
- **Optimizes:** get_s, get_sp, get_spo, get_os
- **Partition Key:** (collection, s) - Better distribution than collection alone
- **Clustering:** (p, o) - Enables efficient predicate/object lookups for a subject
**Table 2: Predicate-Object Queries**
```sql
CREATE TABLE triples_by_po (
collection text,
p text,
o text,
s text,
PRIMARY KEY ((collection, p), o, s)
);
```
- **Optimizes:** get_p, get_po (eliminates ALLOW FILTERING!)
- **Partition Key:** (collection, p) - Direct access by predicate
- **Clustering:** (o, s) - Efficient object-subject traversal
**Table 3: Object-Centric Queries**
```sql
CREATE TABLE triples_by_object (
collection text,
o text,
s text,
p text,
PRIMARY KEY ((collection, o), s, p)
);
```
- **Optimizes:** get_o, get_os
- **Partition Key:** (collection, o) - Direct access by object
- **Clustering:** (s, p) - Efficient subject-predicate traversal
### Query Mapping
| Original Query | Target Table | Performance Improvement |
|----------------|-------------|------------------------|
| get_all(collection) | triples_by_subject | Token-based pagination |
| get_s(collection, s) | triples_by_subject | Direct partition access |
| get_p(collection, p) | triples_by_po | Direct partition access |
| get_o(collection, o) | triples_by_object | Direct partition access |
| get_sp(collection, s, p) | triples_by_subject | Partition + clustering |
| get_po(collection, p, o) | triples_by_po | **No more ALLOW FILTERING!** |
| get_os(collection, o, s) | triples_by_subject | Partition + clustering |
| get_spo(collection, s, p, o) | triples_by_subject | Exact key lookup |
### Benefits
1. **Eliminates ALLOW FILTERING** - Every query has an optimal access path
2. **No Secondary Indexes** - Each table IS the index for its query pattern
3. **Better Data Distribution** - Composite partition keys spread load effectively
4. **Predictable Performance** - Query time proportional to result size, not total data
5. **Leverages Cassandra Strengths** - Designed for Cassandra's architecture
## Implementation Plan
### Files Requiring Changes
#### Primary Implementation File
**`trustgraph-flow/trustgraph/direct/cassandra_kg.py`** - Complete rewrite required
**Current Methods to Refactor:**
```python
# Schema initialization
def init(self) -> None # Replace single table with three tables
# Insert operations
def insert(self, collection, s, p, o) -> None # Write to all three tables
# Query operations (API unchanged, implementation optimized)
def get_all(self, collection, limit=50) # Use triples_by_subject
def get_s(self, collection, s, limit=10) # Use triples_by_subject
def get_p(self, collection, p, limit=10) # Use triples_by_po
def get_o(self, collection, o, limit=10) # Use triples_by_object
def get_sp(self, collection, s, p, limit=10) # Use triples_by_subject
def get_po(self, collection, p, o, limit=10) # Use triples_by_po (NO ALLOW FILTERING!)
def get_os(self, collection, o, s, limit=10) # Use triples_by_subject
def get_spo(self, collection, s, p, o, limit=10) # Use triples_by_subject
# Collection management
def delete_collection(self, collection) -> None # Delete from all three tables
```
#### Integration Files (No Logic Changes Required)
**`trustgraph-flow/trustgraph/storage/triples/cassandra/write.py`**
- No changes needed - uses existing KnowledgeGraph API
- Benefits automatically from performance improvements
**`trustgraph-flow/trustgraph/query/triples/cassandra/service.py`**
- No changes needed - uses existing KnowledgeGraph API
- Benefits automatically from performance improvements
### Test Files Requiring Updates
#### Unit Tests
**`tests/unit/test_storage/test_triples_cassandra_storage.py`**
- Update test expectations for schema changes
- Add tests for multi-table consistency
- Verify no ALLOW FILTERING in query plans
**`tests/unit/test_query/test_triples_cassandra_query.py`**
- Update performance assertions
- Test all 8 query patterns against new tables
- Verify query routing to correct tables
#### Integration Tests
**`tests/integration/test_cassandra_integration.py`**
- End-to-end testing with new schema
- Performance benchmarking comparisons
- Data consistency verification across tables
**`tests/unit/test_storage/test_cassandra_config_integration.py`**
- Update schema validation tests
- Test migration scenarios
### Implementation Strategy
#### Phase 1: Schema and Core Methods
1. **Rewrite `init()` method** - Create three tables instead of one
2. **Rewrite `insert()` method** - Batch writes to all three tables
3. **Implement prepared statements** - For optimal performance
4. **Add table routing logic** - Direct queries to optimal tables
#### Phase 2: Query Method Optimization
1. **Rewrite each get_* method** to use optimal table
2. **Remove all ALLOW FILTERING** usage
3. **Implement efficient clustering key usage**
4. **Add query performance logging**
#### Phase 3: Collection Management
1. **Update `delete_collection()`** - Remove from all three tables
2. **Add consistency verification** - Ensure all tables stay in sync
3. **Implement batch operations** - For atomic multi-table operations
### Key Implementation Details
#### Batch Write Strategy
```python
def insert(self, collection, s, p, o):
batch = BatchStatement()
# Insert into all three tables
batch.add(SimpleStatement(
"INSERT INTO triples_by_subject (collection, s, p, o) VALUES (?, ?, ?, ?)"
), (collection, s, p, o))
batch.add(SimpleStatement(
"INSERT INTO triples_by_po (collection, p, o, s) VALUES (?, ?, ?, ?)"
), (collection, p, o, s))
batch.add(SimpleStatement(
"INSERT INTO triples_by_object (collection, o, s, p) VALUES (?, ?, ?, ?)"
), (collection, o, s, p))
self.session.execute(batch)
```
#### Query Routing Logic
```python
def get_po(self, collection, p, o, limit=10):
# Route to triples_by_po table - NO ALLOW FILTERING!
return self.session.execute(
"SELECT s FROM triples_by_po WHERE collection = ? AND p = ? AND o = ? LIMIT ?",
(collection, p, o, limit)
)
```
#### Prepared Statement Optimization
```python
def prepare_statements(self):
# Cache prepared statements for better performance
self.insert_subject_stmt = self.session.prepare(
"INSERT INTO triples_by_subject (collection, s, p, o) VALUES (?, ?, ?, ?)"
)
self.insert_po_stmt = self.session.prepare(
"INSERT INTO triples_by_po (collection, p, o, s) VALUES (?, ?, ?, ?)"
)
# ... etc for all tables and queries
```
## Migration Strategy
### Data Migration Approach
#### Option 1: Blue-Green Deployment (Recommended)
1. **Deploy new schema alongside existing** - Use different table names temporarily
2. **Dual-write period** - Write to both old and new schemas during transition
3. **Background migration** - Copy existing data to new tables
4. **Switch reads** - Route queries to new tables once data is migrated
5. **Drop old tables** - After verification period
#### Option 2: In-Place Migration
1. **Schema addition** - Create new tables in existing keyspace
2. **Data migration script** - Batch copy from old table to new tables
3. **Application update** - Deploy new code after migration completes
4. **Old table cleanup** - Remove old table and indexes
### Backward Compatibility
#### Deployment Strategy
```python
# Environment variable to control table usage during migration
USE_LEGACY_TABLES = os.getenv('CASSANDRA_USE_LEGACY', 'false').lower() == 'true'
class KnowledgeGraph:
def __init__(self, ...):
if USE_LEGACY_TABLES:
self.init_legacy_schema()
else:
self.init_optimized_schema()
```
#### Migration Script
```python
def migrate_data():
# Read from old table
old_triples = session.execute("SELECT collection, s, p, o FROM triples")
# Batch write to new tables
for batch in batched(old_triples, 100):
batch_stmt = BatchStatement()
for row in batch:
# Add to all three new tables
batch_stmt.add(insert_subject_stmt, row)
batch_stmt.add(insert_po_stmt, (row.collection, row.p, row.o, row.s))
batch_stmt.add(insert_object_stmt, (row.collection, row.o, row.s, row.p))
session.execute(batch_stmt)
```
### Validation Strategy
#### Data Consistency Checks
```python
def validate_migration():
# Count total records in old vs new tables
old_count = session.execute("SELECT COUNT(*) FROM triples WHERE collection = ?", (collection,))
new_count = session.execute("SELECT COUNT(*) FROM triples_by_subject WHERE collection = ?", (collection,))
assert old_count == new_count, f"Record count mismatch: {old_count} vs {new_count}"
# Spot check random samples
sample_queries = generate_test_queries()
for query in sample_queries:
old_result = execute_legacy_query(query)
new_result = execute_optimized_query(query)
assert old_result == new_result, f"Query results differ for {query}"
```
## Testing Strategy
### Performance Testing
#### Benchmark Scenarios
1. **Query Performance Comparison**
- Before/after performance metrics for all 8 query types
- Focus on get_po performance improvement (eliminate ALLOW FILTERING)
- Measure query latency under various data sizes
2. **Load Testing**
- Concurrent query execution
- Write throughput with batch operations
- Memory and CPU utilization
3. **Scalability Testing**
- Performance with increasing collection sizes
- Multi-collection query distribution
- Cluster node utilization
#### Test Data Sets
- **Small:** 10K triples per collection
- **Medium:** 100K triples per collection
- **Large:** 1M+ triples per collection
- **Multiple collections:** Test partition distribution
### Functional Testing
#### Unit Test Updates
```python
# Example test structure for new implementation
class TestCassandraKGPerformance:
def test_get_po_no_allow_filtering(self):
# Verify get_po queries don't use ALLOW FILTERING
with patch('cassandra.cluster.Session.execute') as mock_execute:
kg.get_po('test_collection', 'predicate', 'object')
executed_query = mock_execute.call_args[0][0]
assert 'ALLOW FILTERING' not in executed_query
def test_multi_table_consistency(self):
# Verify all tables stay in sync
kg.insert('test', 's1', 'p1', 'o1')
# Check all tables contain the triple
assert_triple_exists('triples_by_subject', 'test', 's1', 'p1', 'o1')
assert_triple_exists('triples_by_po', 'test', 'p1', 'o1', 's1')
assert_triple_exists('triples_by_object', 'test', 'o1', 's1', 'p1')
```
#### Integration Test Updates
```python
class TestCassandraIntegration:
def test_query_performance_regression(self):
# Ensure new implementation is faster than old
old_time = benchmark_legacy_get_po()
new_time = benchmark_optimized_get_po()
assert new_time < old_time * 0.5 # At least 50% improvement
def test_end_to_end_workflow(self):
# Test complete write -> query -> delete cycle
# Verify no performance degradation in integration
```
### Rollback Plan
#### Quick Rollback Strategy
1. **Environment variable toggle** - Switch back to legacy tables immediately
2. **Keep legacy tables** - Don't drop until performance is proven
3. **Monitoring alerts** - Automated rollback triggers based on error rates/latency
#### Rollback Validation
```python
def rollback_to_legacy():
# Set environment variable
os.environ['CASSANDRA_USE_LEGACY'] = 'true'
# Restart services to pick up change
restart_cassandra_services()
# Validate functionality
run_smoke_tests()
```
## Risks and Considerations
### Performance Risks
- **Write latency increase** - 3x write operations per insert
- **Storage overhead** - 3x storage requirement
- **Batch write failures** - Need proper error handling
### Operational Risks
- **Migration complexity** - Data migration for large datasets
- **Consistency challenges** - Ensuring all tables stay synchronized
- **Monitoring gaps** - Need new metrics for multi-table operations
### Mitigation Strategies
1. **Gradual rollout** - Start with small collections
2. **Comprehensive monitoring** - Track all performance metrics
3. **Automated validation** - Continuous consistency checking
4. **Quick rollback capability** - Environment-based table selection
## Success Criteria
### Performance Improvements
- [ ] **Eliminate ALLOW FILTERING** - get_po and get_os queries run without filtering
- [ ] **Query latency reduction** - 50%+ improvement in query response times
- [ ] **Better load distribution** - No hot partitions, even load across cluster nodes
- [ ] **Scalable performance** - Query time proportional to result size, not total data
### Functional Requirements
- [ ] **API compatibility** - All existing code continues to work unchanged
- [ ] **Data consistency** - All three tables remain synchronized
- [ ] **Zero data loss** - Migration preserves all existing triples
- [ ] **Backward compatibility** - Ability to rollback to legacy schema
### Operational Requirements
- [ ] **Safe migration** - Blue-green deployment with rollback capability
- [ ] **Monitoring coverage** - Comprehensive metrics for multi-table operations
- [ ] **Test coverage** - All query patterns tested with performance benchmarks
- [ ] **Documentation** - Updated deployment and operational procedures
## Timeline
### Phase 1: Implementation
- [ ] Rewrite `cassandra_kg.py` with multi-table schema
- [ ] Implement batch write operations
- [ ] Add prepared statement optimization
- [ ] Update unit tests
### Phase 2: Integration Testing
- [ ] Update integration tests
- [ ] Performance benchmarking
- [ ] Load testing with realistic data volumes
- [ ] Validation scripts for data consistency
### Phase 3: Migration Planning
- [ ] Blue-green deployment scripts
- [ ] Data migration tools
- [ ] Monitoring dashboard updates
- [ ] Rollback procedures
### Phase 4: Production Deployment
- [ ] Staged rollout to production
- [ ] Performance monitoring and validation
- [ ] Legacy table cleanup
- [ ] Documentation updates
## Conclusion
This multi-table denormalization strategy directly addresses the two critical performance bottlenecks:
1. **Eliminates expensive ALLOW FILTERING** by providing optimal table structures for each query pattern
2. **Improves clustering effectiveness** through composite partition keys that distribute load properly
The approach leverages Cassandra's strengths while maintaining complete API compatibility, ensuring existing code benefits automatically from the performance improvements.

View file

@ -0,0 +1,349 @@
# Collection Management Technical Specification
## Overview
This specification describes the collection management capabilities for TrustGraph, enabling users to have explicit control over collections that are currently implicitly created during data loading and querying operations. The feature supports four primary use cases:
1. **Collection Listing**: View all existing collections in the system
2. **Collection Deletion**: Remove unwanted collections and their associated data
3. **Collection Labeling**: Associate descriptive labels with collections for better organization
4. **Collection Tagging**: Apply tags to collections for categorization and easier discovery
## Goals
- **Explicit Collection Control**: Provide users with direct management capabilities over collections beyond implicit creation
- **Collection Visibility**: Enable users to list and inspect all collections in their environment
- **Collection Cleanup**: Allow deletion of collections that are no longer needed
- **Collection Organization**: Support labels and tags for better collection tracking and discovery
- **Metadata Management**: Associate meaningful metadata with collections for operational clarity
- **Collection Discovery**: Make it easier to find specific collections through filtering and search
- **Operational Transparency**: Provide clear visibility into collection lifecycle and usage
- **Resource Management**: Enable cleanup of unused collections to optimize resource utilization
## Background
Currently, collections in TrustGraph are implicitly created during data loading operations and query execution. While this provides convenience for users, it lacks the explicit control needed for production environments and long-term data management.
Current limitations include:
- No way to list existing collections
- No mechanism to delete unwanted collections
- No ability to associate metadata with collections for tracking purposes
- Difficulty in organizing and discovering collections over time
This specification addresses these gaps by introducing explicit collection management operations. By providing collection management APIs and commands, TrustGraph can:
- Give users full control over their collection lifecycle
- Enable better organization through labels and tags
- Support collection cleanup for resource optimization
- Improve operational visibility and management
## Technical Design
### Architecture
The collection management system will be implemented within existing TrustGraph infrastructure:
1. **Librarian Service Integration**
- Collection management operations will be added to the existing librarian service
- No new service required - leverages existing authentication and access patterns
- Handles collection listing, deletion, and metadata management
Module: trustgraph-librarian
2. **Cassandra Collection Metadata Table**
- New table in the existing librarian keyspace
- Stores collection metadata with user-scoped access
- Primary key: (user_id, collection_id) for proper multi-tenancy
Module: trustgraph-librarian
3. **Collection Management CLI**
- Command-line interface for collection operations
- Provides list, delete, label, and tag management commands
- Integrates with existing CLI framework
Module: trustgraph-cli
### Data Models
#### Cassandra Collection Metadata Table
The collection metadata will be stored in a structured Cassandra table in the librarian keyspace:
```sql
CREATE TABLE collections (
user text,
collection text,
name text,
description text,
tags set<text>,
created_at timestamp,
updated_at timestamp,
PRIMARY KEY (user, collection)
);
```
Table structure:
- **user** + **collection**: Composite primary key ensuring user isolation
- **name**: Human-readable collection name
- **description**: Detailed description of collection purpose
- **tags**: Set of tags for categorization and filtering
- **created_at**: Collection creation timestamp
- **updated_at**: Last modification timestamp
This approach allows:
- Multi-tenant collection management with user isolation
- Efficient querying by user and collection
- Flexible tagging system for organization
- Lifecycle tracking for operational insights
#### Collection Lifecycle
Collections follow a lazy-creation pattern that aligns with existing TrustGraph behavior:
1. **Lazy Creation**: Collections are automatically created when first referenced during data loading or query operations. No explicit create operation is needed.
2. **Implicit Registration**: When a collection is used (data loading, querying), the system checks if a metadata record exists. If not, a new record is created with default values:
- `name`: defaults to collection_id
- `description`: empty
- `tags`: empty set
- `created_at`: current timestamp
3. **Explicit Updates**: Users can update collection metadata (name, description, tags) through management operations after lazy creation.
4. **Explicit Deletion**: Users can delete collections, which removes both the metadata record and the underlying collection data across all store types.
5. **Multi-Store Deletion**: Collection deletion cascades across all storage backends (vector stores, object stores, triple stores) as each implements lazy creation and must support collection deletion.
Operations required:
- **Collection Use Notification**: Internal operation triggered during data loading/querying to ensure metadata record exists
- **Update Collection Metadata**: User operation to modify name, description, and tags
- **Delete Collection**: User operation to remove collection and its data across all stores
- **List Collections**: User operation to view collections with filtering by tags
#### Multi-Store Collection Management
Collections exist across multiple storage backends in TrustGraph:
- **Vector Stores**: Store embeddings and vector data for collections
- **Object Stores**: Store documents and file data for collections
- **Triple Stores**: Store graph/RDF data for collections
Each store type implements:
- **Lazy Creation**: Collections are created implicitly when data is first stored
- **Collection Deletion**: Store-specific deletion operations to remove collection data
The librarian service coordinates collection operations across all store types, ensuring consistent collection lifecycle management.
### APIs
New APIs:
- **List Collections**: Retrieve collections for a user with optional tag filtering
- **Update Collection Metadata**: Modify collection name, description, and tags
- **Delete Collection**: Remove collection and associated data with confirmation, cascading to all store types
- **Collection Use Notification** (Internal): Ensure metadata record exists when collection is referenced
Store Writer APIs (Enhanced):
- **Vector Store Collection Deletion**: Remove vector data for specified user and collection
- **Object Store Collection Deletion**: Remove object/document data for specified user and collection
- **Triple Store Collection Deletion**: Remove graph/RDF data for specified user and collection
Modified APIs:
- **Data Loading APIs**: Enhanced to trigger collection use notification for lazy metadata creation
- **Query APIs**: Enhanced to trigger collection use notification and optionally include metadata in responses
### Implementation Details
The implementation will follow existing TrustGraph patterns for service integration and CLI command structure.
#### Collection Deletion Cascade
When a user initiates collection deletion through the librarian service:
1. **Metadata Validation**: Verify collection exists and user has permission to delete
2. **Store Cascade**: Librarian coordinates deletion across all store writers:
- Vector store writer: Remove embeddings and vector indexes for the user and collection
- Object store writer: Remove documents and files for the user and collection
- Triple store writer: Remove graph data and triples for the user and collection
3. **Metadata Cleanup**: Remove collection metadata record from Cassandra
4. **Error Handling**: If any store deletion fails, maintain consistency through rollback or retry mechanisms
#### Collection Management Interface
All store writers will implement a standardized collection management interface with a common schema across store types:
**Message Schema:**
```json
{
"operation": "delete-collection",
"user": "user123",
"collection": "documents-2024",
"timestamp": "2024-01-15T10:30:00Z"
}
```
**Queue Architecture:**
- **Object Store Collection Management Queue**: Handles collection operations for object/document stores
- **Vector Store Collection Management Queue**: Handles collection operations for vector/embedding stores
- **Triple Store Collection Management Queue**: Handles collection operations for graph/RDF stores
Each store writer implements:
- **Collection Management Handler**: Separate from standard data storage handlers
- **Delete Collection Operation**: Removes all data associated with the specified collection
- **Message Processing**: Consumes from dedicated collection management queue
- **Status Reporting**: Returns success/failure status for coordination
- **Idempotent Operations**: Handles cases where collection doesn't exist (no-op)
**Initial Implementation:**
Only `delete-collection` operation will be implemented initially. The interface supports future operations like `archive-collection`, `migrate-collection`, etc.
#### Cassandra Triple Store Refactor
As part of this implementation, the Cassandra triple store will be refactored from a table-per-collection model to a unified table model:
**Current Architecture:**
- Keyspace per user, separate table per collection
- Schema: `(s, p, o)` with `PRIMARY KEY (s, p, o)`
- Table names: user collections become separate Cassandra tables
**New Architecture:**
- Keyspace per user, single "triples" table for all collections
- Schema: `(collection, s, p, o)` with `PRIMARY KEY (collection, s, p, o)`
- Collection isolation through collection partitioning
**Changes Required:**
1. **TrustGraph Class Refactor** (`trustgraph/direct/cassandra.py`):
- Remove `table` parameter from constructor, use fixed "triples" table
- Add `collection` parameter to all methods
- Update schema to include collection as first column
- **Index Updates**: New indexes will be created to support all 8 query patterns:
- Index on `(s)` for subject-based queries
- Index on `(p)` for predicate-based queries
- Index on `(o)` for object-based queries
- Note: Cassandra doesn't support multi-column secondary indexes, so these are single-column indexes
- **Query Pattern Performance**:
- ✅ `get_all()` - partition scan on `collection`
- ✅ `get_s(s)` - uses primary key efficiently (`collection, s`)
- ✅ `get_p(p)` - uses `idx_p` with `collection` filtering
- ✅ `get_o(o)` - uses `idx_o` with `collection` filtering
- ✅ `get_sp(s, p)` - uses primary key efficiently (`collection, s, p`)
- ⚠️ `get_po(p, o)` - requires `ALLOW FILTERING` (uses either `idx_p` or `idx_o` plus filtering)
- ✅ `get_os(o, s)` - uses `idx_o` with additional filtering on `s`
- ✅ `get_spo(s, p, o)` - uses full primary key efficiently
- **Note on ALLOW FILTERING**: The `get_po` query pattern requires `ALLOW FILTERING` as it needs both predicate and object constraints without a suitable compound index. This is acceptable as this query pattern is less common than subject-based queries in typical triple store usage
2. **Storage Writer Updates** (`trustgraph/storage/triples/cassandra/write.py`):
- Maintain single TrustGraph connection per user instead of per (user, collection)
- Pass collection to insert operations
- Improved resource utilization with fewer connections
3. **Query Service Updates** (`trustgraph/query/triples/cassandra/service.py`):
- Single TrustGraph connection per user
- Pass collection to all query operations
- Maintain same query logic with collection parameter
**Benefits:**
- **Simplified Collection Deletion**: Simple `DELETE FROM triples WHERE collection = ?` instead of dropping tables
- **Resource Efficiency**: Fewer database connections and table objects
- **Cross-Collection Operations**: Easier to implement operations spanning multiple collections
- **Consistent Architecture**: Aligns with unified collection metadata approach
**Migration Strategy:**
Existing table-per-collection data will need migration to the new unified schema during the upgrade process.
Collection operations will be atomic where possible and provide appropriate error handling and validation.
## Security Considerations
Collection management operations require appropriate authorization to prevent unauthorized access or deletion of collections. Access control will align with existing TrustGraph security models.
## Performance Considerations
Collection listing operations may need pagination for environments with large numbers of collections. Metadata queries should be optimized for common filtering patterns.
## Testing Strategy
Comprehensive testing will cover collection lifecycle operations, metadata management, and CLI command functionality with both unit and integration tests.
## Migration Plan
This implementation requires both metadata and storage migrations:
### Collection Metadata Migration
Existing collections will need to be registered in the new Cassandra collections metadata table. A migration process will:
- Scan existing keyspaces and tables to identify collections
- Create metadata records with default values (name=collection_id, empty description/tags)
- Preserve creation timestamps where possible
### Cassandra Triple Store Migration
The Cassandra storage refactor requires data migration from table-per-collection to unified table:
- **Pre-migration**: Identify all user keyspaces and collection tables
- **Data Transfer**: Copy triples from individual collection tables to unified "triples" table with collection
- **Schema Validation**: Ensure new primary key structure maintains query performance
- **Cleanup**: Remove old collection tables after successful migration
- **Rollback Plan**: Maintain ability to restore table-per-collection structure if needed
Migration will be performed during a maintenance window to ensure data consistency.
## Implementation Status
### ✅ Completed Components
1. **Librarian Collection Management Service** (`trustgraph-flow/trustgraph/librarian/collection_service.py`)
- Complete collection CRUD operations (list, update, delete)
- Cassandra collection metadata table integration via `LibraryTableStore`
- Async request/response handling with proper error management
- Collection deletion cascade coordination across all storage types
2. **Collection Metadata Schema** (`trustgraph-base/trustgraph/schema/services/collection.py`)
- `CollectionManagementRequest` and `CollectionManagementResponse` schemas
- `CollectionMetadata` schema for collection records
- Collection request/response queue topic definitions
3. **Storage Management Schema** (`trustgraph-base/trustgraph/schema/services/storage.py`)
- `StorageManagementRequest` and `StorageManagementResponse` schemas
- Message format for storage-level collection operations
### ❌ Missing Components
1. **Storage Management Queue Topics**
- Missing topic definitions in schema for:
- `vector_storage_management_topic`
- `object_storage_management_topic`
- `triples_storage_management_topic`
- `storage_management_response_topic`
- These are referenced by the librarian service but not yet defined
2. **Store Collection Management Handlers**
- **Vector Store Writers** (Qdrant, Milvus, Pinecone): No collection deletion handlers
- **Object Store Writers** (Cassandra): No collection deletion handlers
- **Triple Store Writers** (Cassandra, Neo4j, Memgraph, FalkorDB): No collection deletion handlers
- Need to implement `StorageManagementRequest` processing in each store writer
3. **Collection Management Interface Implementation**
- Store writers need collection management message consumers
- Collection deletion operations need to be implemented per store type
- Response handling back to librarian service
### Next Implementation Steps
1. **Define Storage Management Topics** in `trustgraph-base/trustgraph/schema/services/storage.py`
2. **Implement Collection Management Handlers** in each storage writer:
- Add `StorageManagementRequest` consumers
- Implement collection deletion operations
- Add response producers for status reporting
3. **Test End-to-End Collection Deletion** across all storage types
## Timeline
Phase 1 (Storage Topics): 1-2 days
Phase 2 (Store Handlers): 1-2 weeks depending on number of storage backends
Phase 3 (Testing & Integration): 3-5 days
## Open Questions
- Should collection deletion be soft or hard delete by default?
- What metadata fields should be required vs optional?
- Should we implement storage management handlers incrementally by store type?

View file

@ -0,0 +1,156 @@
# Flow Class Definition Specification
## Overview
A flow class defines a complete dataflow pattern template in the TrustGraph system. When instantiated, it creates an interconnected network of processors that handle data ingestion, processing, storage, and querying as a unified system.
## Structure
A flow class definition consists of four main sections:
### 1. Class Section
Defines shared service processors that are instantiated once per flow class. These processors handle requests from all flow instances of this class.
```json
"class": {
"service-name:{class}": {
"request": "queue-pattern:{class}",
"response": "queue-pattern:{class}"
}
}
```
**Characteristics:**
- Shared across all flow instances of the same class
- Typically expensive or stateless services (LLMs, embedding models)
- Use `{class}` template variable for queue naming
- Examples: `embeddings:{class}`, `text-completion:{class}`, `graph-rag:{class}`
### 2. Flow Section
Defines flow-specific processors that are instantiated for each individual flow instance. Each flow gets its own isolated set of these processors.
```json
"flow": {
"processor-name:{id}": {
"input": "queue-pattern:{id}",
"output": "queue-pattern:{id}"
}
}
```
**Characteristics:**
- Unique instance per flow
- Handle flow-specific data and state
- Use `{id}` template variable for queue naming
- Examples: `chunker:{id}`, `pdf-decoder:{id}`, `kg-extract-relationships:{id}`
### 3. Interfaces Section
Defines the entry points and interaction contracts for the flow. These form the API surface for external systems and internal component communication.
Interfaces can take two forms:
**Fire-and-Forget Pattern** (single queue):
```json
"interfaces": {
"document-load": "persistent://tg/flow/document-load:{id}",
"triples-store": "persistent://tg/flow/triples-store:{id}"
}
```
**Request/Response Pattern** (object with request/response fields):
```json
"interfaces": {
"embeddings": {
"request": "non-persistent://tg/request/embeddings:{class}",
"response": "non-persistent://tg/response/embeddings:{class}"
}
}
```
**Types of Interfaces:**
- **Entry Points**: Where external systems inject data (`document-load`, `agent`)
- **Service Interfaces**: Request/response patterns for services (`embeddings`, `text-completion`)
- **Data Interfaces**: Fire-and-forget data flow connection points (`triples-store`, `entity-contexts-load`)
### 4. Metadata
Additional information about the flow class:
```json
"description": "Human-readable description",
"tags": ["capability-1", "capability-2"]
```
## Template Variables
### {id}
- Replaced with the unique flow instance identifier
- Creates isolated resources for each flow
- Example: `flow-123`, `customer-A-flow`
### {class}
- Replaced with the flow class name
- Creates shared resources across flows of the same class
- Example: `standard-rag`, `enterprise-rag`
## Queue Patterns (Pulsar)
Flow classes use Apache Pulsar for messaging. Queue names follow the Pulsar format:
```
<persistence>://<tenant>/<namespace>/<topic>
```
### Components:
- **persistence**: `persistent` or `non-persistent` (Pulsar persistence mode)
- **tenant**: `tg` for TrustGraph-supplied flow class definitions
- **namespace**: Indicates the messaging pattern
- `flow`: Fire-and-forget services
- `request`: Request portion of request/response services
- `response`: Response portion of request/response services
- **topic**: The specific queue/topic name with template variables
### Persistent Queues
- Pattern: `persistent://tg/flow/<topic>:{id}`
- Used for fire-and-forget services and durable data flow
- Data persists in Pulsar storage across restarts
- Example: `persistent://tg/flow/chunk-load:{id}`
### Non-Persistent Queues
- Pattern: `non-persistent://tg/request/<topic>:{class}` or `non-persistent://tg/response/<topic>:{class}`
- Used for request/response messaging patterns
- Ephemeral, not persisted to disk by Pulsar
- Lower latency, suitable for RPC-style communication
- Example: `non-persistent://tg/request/embeddings:{class}`
## Dataflow Architecture
The flow class creates a unified dataflow where:
1. **Document Processing Pipeline**: Flows from ingestion through transformation to storage
2. **Query Services**: Integrated processors that query the same data stores and services
3. **Shared Services**: Centralized processors that all flows can utilize
4. **Storage Writers**: Persist processed data to appropriate stores
All processors (both `{id}` and `{class}`) work together as a cohesive dataflow graph, not as separate systems.
## Example Flow Instantiation
Given:
- Flow Instance ID: `customer-A-flow`
- Flow Class: `standard-rag`
Template expansions:
- `persistent://tg/flow/chunk-load:{id}``persistent://tg/flow/chunk-load:customer-A-flow`
- `non-persistent://tg/request/embeddings:{class}``non-persistent://tg/request/embeddings:standard-rag`
This creates:
- Isolated document processing pipeline for `customer-A-flow`
- Shared embedding service for all `standard-rag` flows
- Complete dataflow from document ingestion through querying
## Benefits
1. **Resource Efficiency**: Expensive services are shared across flows
2. **Flow Isolation**: Each flow has its own data processing pipeline
3. **Scalability**: Can instantiate multiple flows from the same template
4. **Modularity**: Clear separation between shared and flow-specific components
5. **Unified Architecture**: Query and processing are part of the same dataflow

View file

@ -0,0 +1,383 @@
# GraphQL Query Technical Specification
## Overview
This specification describes the implementation of a GraphQL query interface for TrustGraph's structured data storage in Apache Cassandra. Building upon the structured data capabilities outlined in the structured-data.md specification, this document details how GraphQL queries will be executed against Cassandra tables containing extracted and ingested structured objects.
The GraphQL query service will provide a flexible, type-safe interface for querying structured data stored in Cassandra. It will dynamically adapt to schema changes, support complex queries including relationships between objects, and integrate seamlessly with TrustGraph's existing message-based architecture.
## Goals
- **Dynamic Schema Support**: Automatically adapt to schema changes in configuration without service restarts
- **GraphQL Standards Compliance**: Provide a standard GraphQL interface compatible with existing GraphQL tooling and clients
- **Efficient Cassandra Queries**: Translate GraphQL queries into efficient Cassandra CQL queries respecting partition keys and indexes
- **Relationship Resolution**: Support GraphQL field resolvers for relationships between different object types
- **Type Safety**: Ensure type-safe query execution and response generation based on schema definitions
- **Scalable Performance**: Handle concurrent queries efficiently with proper connection pooling and query optimization
- **Request/Response Integration**: Maintain compatibility with TrustGraph's Pulsar-based request/response pattern
- **Error Handling**: Provide comprehensive error reporting for schema mismatches, query errors, and data validation issues
## Background
The structured data storage implementation (trustgraph-flow/trustgraph/storage/objects/cassandra/) writes objects to Cassandra tables based on schema definitions stored in TrustGraph's configuration system. These tables use a composite partition key structure with collection and schema-defined primary keys, enabling efficient queries within collections.
Current limitations that this specification addresses:
- No query interface for the structured data stored in Cassandra
- Inability to leverage GraphQL's powerful query capabilities for structured data
- Missing support for relationship traversal between related objects
- Lack of a standardized query language for structured data access
The GraphQL query service will bridge these gaps by:
- Providing a standard GraphQL interface for querying Cassandra tables
- Dynamically generating GraphQL schemas from TrustGraph configuration
- Efficiently translating GraphQL queries to Cassandra CQL
- Supporting relationship resolution through field resolvers
## Technical Design
### Architecture
The GraphQL query service will be implemented as a new TrustGraph flow processor following established patterns:
**Module Location**: `trustgraph-flow/trustgraph/query/objects/cassandra/`
**Key Components**:
1. **GraphQL Query Service Processor**
- Extends base FlowProcessor class
- Implements request/response pattern similar to existing query services
- Monitors configuration for schema updates
- Maintains GraphQL schema synchronized with configuration
2. **Dynamic Schema Generator**
- Converts TrustGraph RowSchema definitions to GraphQL types
- Creates GraphQL object types with proper field definitions
- Generates root Query type with collection-based resolvers
- Updates GraphQL schema when configuration changes
3. **Query Executor**
- Parses incoming GraphQL queries using Strawberry library
- Validates queries against current schema
- Executes queries and returns structured responses
- Handles errors gracefully with detailed error messages
4. **Cassandra Query Translator**
- Converts GraphQL selections to CQL queries
- Optimizes queries based on available indexes and partition keys
- Handles filtering, pagination, and sorting
- Manages connection pooling and session lifecycle
5. **Relationship Resolver**
- Implements field resolvers for object relationships
- Performs efficient batch loading to avoid N+1 queries
- Caches resolved relationships within request context
- Supports both forward and reverse relationship traversal
### Configuration Schema Monitoring
The service will register a configuration handler to receive schema updates:
```python
self.register_config_handler(self.on_schema_config)
```
When schemas change:
1. Parse new schema definitions from configuration
2. Regenerate GraphQL types and resolvers
3. Update the executable schema
4. Clear any schema-dependent caches
### GraphQL Schema Generation
For each RowSchema in configuration, generate:
1. **GraphQL Object Type**:
- Map field types (string → String, integer → Int, float → Float, boolean → Boolean)
- Mark required fields as non-nullable in GraphQL
- Add field descriptions from schema
2. **Root Query Fields**:
- Collection query (e.g., `customers`, `transactions`)
- Filtering arguments based on indexed fields
- Pagination support (limit, offset)
- Sorting options for sortable fields
3. **Relationship Fields**:
- Identify foreign key relationships from schema
- Create field resolvers for related objects
- Support both single object and list relationships
### Query Execution Flow
1. **Request Reception**:
- Receive ObjectsQueryRequest from Pulsar
- Extract GraphQL query string and variables
- Identify user and collection context
2. **Query Validation**:
- Parse GraphQL query using Strawberry
- Validate against current schema
- Check field selections and argument types
3. **CQL Generation**:
- Analyze GraphQL selections
- Build CQL query with proper WHERE clauses
- Include collection in partition key
- Apply filters based on GraphQL arguments
4. **Query Execution**:
- Execute CQL query against Cassandra
- Map results to GraphQL response structure
- Resolve any relationship fields
- Format response according to GraphQL spec
5. **Response Delivery**:
- Create ObjectsQueryResponse with results
- Include any execution errors
- Send response via Pulsar with correlation ID
### Data Models
> **Note**: An existing StructuredQueryRequest/Response schema exists in `trustgraph-base/trustgraph/schema/services/structured_query.py`. However, it lacks critical fields (user, collection) and uses suboptimal types. The schemas below represent the recommended evolution, which should either replace the existing schemas or be created as new ObjectsQueryRequest/Response types.
#### Request Schema (ObjectsQueryRequest)
```python
from pulsar.schema import Record, String, Map, Array
class ObjectsQueryRequest(Record):
user = String() # Cassandra keyspace (follows pattern from TriplesQueryRequest)
collection = String() # Data collection identifier (required for partition key)
query = String() # GraphQL query string
variables = Map(String()) # GraphQL variables (consider enhancing to support all JSON types)
operation_name = String() # Operation to execute for multi-operation documents
```
**Rationale for changes from existing StructuredQueryRequest:**
- Added `user` and `collection` fields to match other query services pattern
- These fields are essential for identifying the Cassandra keyspace and collection
- Variables remain as Map(String()) for now but should ideally support all JSON types
#### Response Schema (ObjectsQueryResponse)
```python
from pulsar.schema import Record, String, Array
from ..core.primitives import Error
class GraphQLError(Record):
message = String()
path = Array(String()) # Path to the field that caused the error
extensions = Map(String()) # Additional error metadata
class ObjectsQueryResponse(Record):
error = Error() # System-level error (connection, timeout, etc.)
data = String() # JSON-encoded GraphQL response data
errors = Array(GraphQLError) # GraphQL field-level errors
extensions = Map(String()) # Query metadata (execution time, etc.)
```
**Rationale for changes from existing StructuredQueryResponse:**
- Distinguishes between system errors (`error`) and GraphQL errors (`errors`)
- Uses structured GraphQLError objects instead of string array
- Adds `extensions` field for GraphQL spec compliance
- Keeps data as JSON string for compatibility, though native types would be preferable
### Cassandra Query Optimization
The service will optimize Cassandra queries by:
1. **Respecting Partition Keys**:
- Always include collection in queries
- Use schema-defined primary keys efficiently
- Avoid full table scans
2. **Leveraging Indexes**:
- Use secondary indexes for filtering
- Combine multiple filters when possible
- Warn when queries may be inefficient
3. **Batch Loading**:
- Collect relationship queries
- Execute in batches to reduce round trips
- Cache results within request context
4. **Connection Management**:
- Maintain persistent Cassandra sessions
- Use connection pooling
- Handle reconnection on failures
### Example GraphQL Queries
#### Simple Collection Query
```graphql
{
customers(status: "active") {
customer_id
name
email
registration_date
}
}
```
#### Query with Relationships
```graphql
{
orders(order_date_gt: "2024-01-01") {
order_id
total_amount
customer {
name
email
}
items {
product_name
quantity
price
}
}
}
```
#### Paginated Query
```graphql
{
products(limit: 20, offset: 40) {
product_id
name
price
category
}
}
```
### Implementation Dependencies
- **Strawberry GraphQL**: For GraphQL schema definition and query execution
- **Cassandra Driver**: For database connectivity (already used in storage module)
- **TrustGraph Base**: For FlowProcessor and schema definitions
- **Configuration System**: For schema monitoring and updates
### Command-Line Interface
The service will provide a CLI command: `kg-query-objects-graphql-cassandra`
Arguments:
- `--cassandra-host`: Cassandra cluster contact point
- `--cassandra-username`: Authentication username
- `--cassandra-password`: Authentication password
- `--config-type`: Configuration type for schemas (default: "schema")
- Standard FlowProcessor arguments (Pulsar configuration, etc.)
## API Integration
### Pulsar Topics
**Input Topic**: `objects-graphql-query-request`
- Schema: ObjectsQueryRequest
- Receives GraphQL queries from gateway services
**Output Topic**: `objects-graphql-query-response`
- Schema: ObjectsQueryResponse
- Returns query results and errors
### Gateway Integration
The gateway and reverse-gateway will need endpoints to:
1. Accept GraphQL queries from clients
2. Forward to the query service via Pulsar
3. Return responses to clients
4. Support GraphQL introspection queries
### Agent Tool Integration
A new agent tool class will enable:
- Natural language to GraphQL query generation
- Direct GraphQL query execution
- Result interpretation and formatting
- Integration with agent decision flows
## Security Considerations
- **Query Depth Limiting**: Prevent deeply nested queries that could cause performance issues
- **Query Complexity Analysis**: Limit query complexity to prevent resource exhaustion
- **Field-Level Permissions**: Future support for field-level access control based on user roles
- **Input Sanitization**: Validate and sanitize all query inputs to prevent injection attacks
- **Rate Limiting**: Implement query rate limiting per user/collection
## Performance Considerations
- **Query Planning**: Analyze queries before execution to optimize CQL generation
- **Result Caching**: Consider caching frequently accessed data at the field resolver level
- **Connection Pooling**: Maintain efficient connection pools to Cassandra
- **Batch Operations**: Combine multiple queries when possible to reduce latency
- **Monitoring**: Track query performance metrics for optimization
## Testing Strategy
### Unit Tests
- Schema generation from RowSchema definitions
- GraphQL query parsing and validation
- CQL query generation logic
- Field resolver implementations
### Contract Tests
- Pulsar message contract compliance
- GraphQL schema validity
- Response format verification
- Error structure validation
### Integration Tests
- End-to-end query execution against test Cassandra instance
- Schema update handling
- Relationship resolution
- Pagination and filtering
- Error scenarios
### Performance Tests
- Query throughput under load
- Response time for various query complexities
- Memory usage with large result sets
- Connection pool efficiency
## Migration Plan
No migration required as this is a new capability. The service will:
1. Read existing schemas from configuration
2. Connect to existing Cassandra tables created by the storage module
3. Start accepting queries immediately upon deployment
## Timeline
- Week 1-2: Core service implementation and schema generation
- Week 3: Query execution and CQL translation
- Week 4: Relationship resolution and optimization
- Week 5: Testing and performance tuning
- Week 6: Gateway integration and documentation
## Open Questions
1. **Schema Evolution**: How should the service handle queries during schema transitions?
- Option: Queue queries during schema updates
- Option: Support multiple schema versions simultaneously
2. **Caching Strategy**: Should query results be cached?
- Consider: Time-based expiration
- Consider: Event-based invalidation
3. **Federation Support**: Should the service support GraphQL federation for combining with other data sources?
- Would enable unified queries across structured and graph data
4. **Subscription Support**: Should the service support GraphQL subscriptions for real-time updates?
- Would require WebSocket support in gateway
5. **Custom Scalars**: Should custom scalar types be supported for domain-specific data types?
- Examples: DateTime, UUID, JSON fields
## References
- Structured Data Technical Specification: `docs/tech-specs/structured-data.md`
- Strawberry GraphQL Documentation: https://strawberry.rocks/
- GraphQL Specification: https://spec.graphql.org/
- Apache Cassandra CQL Reference: https://cassandra.apache.org/doc/stable/cassandra/cql/
- TrustGraph Flow Processor Documentation: Internal documentation

View file

@ -0,0 +1,682 @@
# Import/Export Graceful Shutdown Technical Specification
## Problem Statement
The TrustGraph gateway currently experiences message loss during websocket closure in both import and export operations. This occurs due to race conditions where messages in transit are discarded before reaching their destination (Pulsar queues for imports, websocket clients for exports).
### Import-Side Issues
1. Publisher's asyncio.Queue buffer is not drained on shutdown
2. Websocket closes before ensuring queued messages reach Pulsar
3. No acknowledgment mechanism for successful message delivery
### Export-Side Issues
1. Messages are acknowledged in Pulsar before successful delivery to clients
2. Hard-coded timeouts cause message drops when queues are full
3. No backpressure mechanism for handling slow consumers
4. Multiple buffer points where data can be lost
## Architecture Overview
```
Import Flow:
Client -> Websocket -> TriplesImport -> Publisher -> Pulsar Queue
Export Flow:
Pulsar Queue -> Subscriber -> TriplesExport -> Websocket -> Client
```
## Proposed Fixes
### 1. Publisher Improvements (Import Side)
#### A. Graceful Queue Draining
**File**: `trustgraph-base/trustgraph/base/publisher.py`
```python
class Publisher:
def __init__(self, client, topic, schema=None, max_size=10,
chunking_enabled=True, drain_timeout=5.0):
self.client = client
self.topic = topic
self.schema = schema
self.q = asyncio.Queue(maxsize=max_size)
self.chunking_enabled = chunking_enabled
self.running = True
self.draining = False # New state for graceful shutdown
self.task = None
self.drain_timeout = drain_timeout
async def stop(self):
"""Initiate graceful shutdown with draining"""
self.running = False
self.draining = True
if self.task:
# Wait for run() to complete draining
await self.task
async def run(self):
"""Enhanced run method with integrated draining logic"""
while self.running or self.draining:
try:
producer = self.client.create_producer(
topic=self.topic,
schema=JsonSchema(self.schema),
chunking_enabled=self.chunking_enabled,
)
drain_end_time = None
while self.running or self.draining:
try:
# Start drain timeout when entering drain mode
if self.draining and drain_end_time is None:
drain_end_time = time.time() + self.drain_timeout
logger.info(f"Publisher entering drain mode, timeout={self.drain_timeout}s")
# Check drain timeout
if self.draining and time.time() > drain_end_time:
if not self.q.empty():
logger.warning(f"Drain timeout reached with {self.q.qsize()} messages remaining")
self.draining = False
break
# Calculate wait timeout based on mode
if self.draining:
# Shorter timeout during draining to exit quickly when empty
timeout = min(0.1, drain_end_time - time.time())
else:
# Normal operation timeout
timeout = 0.25
# Get message from queue
id, item = await asyncio.wait_for(
self.q.get(),
timeout=timeout
)
# Send the message (single place for sending)
if id:
producer.send(item, { "id": id })
else:
producer.send(item)
except asyncio.TimeoutError:
# If draining and queue is empty, we're done
if self.draining and self.q.empty():
logger.info("Publisher queue drained successfully")
self.draining = False
break
continue
except asyncio.QueueEmpty:
# If draining and queue is empty, we're done
if self.draining and self.q.empty():
logger.info("Publisher queue drained successfully")
self.draining = False
break
continue
# Flush producer before closing
if producer:
producer.flush()
producer.close()
except Exception as e:
logger.error(f"Exception in publisher: {e}", exc_info=True)
if not self.running and not self.draining:
return
# If handler drops out, sleep a retry
await asyncio.sleep(1)
async def send(self, id, item):
"""Send still works normally - just adds to queue"""
if self.draining:
# Optionally reject new messages during drain
raise RuntimeError("Publisher is shutting down, not accepting new messages")
await self.q.put((id, item))
```
**Key Design Benefits:**
- **Single Send Location**: All `producer.send()` calls happen in one place within the `run()` method
- **Clean State Machine**: Three clear states - running, draining, stopped
- **Timeout Protection**: Won't hang indefinitely during drain
- **Better Observability**: Clear logging of drain progress and state transitions
- **Optional Message Rejection**: Can reject new messages during shutdown phase
#### B. Improved Shutdown Order
**File**: `trustgraph-flow/trustgraph/gateway/dispatch/triples_import.py`
```python
class TriplesImport:
async def destroy(self):
"""Enhanced destroy with proper shutdown order"""
# Step 1: Stop accepting new messages
self.running.stop()
# Step 2: Wait for publisher to drain its queue
logger.info("Draining publisher queue...")
await self.publisher.stop()
# Step 3: Close websocket only after queue is drained
if self.ws:
await self.ws.close()
```
### 2. Subscriber Improvements (Export Side)
#### A. Integrated Draining Pattern
**File**: `trustgraph-base/trustgraph/base/subscriber.py`
```python
class Subscriber:
def __init__(self, client, topic, subscription, consumer_name,
schema=None, max_size=100, metrics=None,
backpressure_strategy="block", drain_timeout=5.0):
# ... existing init ...
self.backpressure_strategy = backpressure_strategy
self.running = True
self.draining = False # New state for graceful shutdown
self.drain_timeout = drain_timeout
self.pending_acks = {} # Track messages awaiting delivery
async def stop(self):
"""Initiate graceful shutdown with draining"""
self.running = False
self.draining = True
if self.task:
# Wait for run() to complete draining
await self.task
async def run(self):
"""Enhanced run method with integrated draining logic"""
while self.running or self.draining:
if self.metrics:
self.metrics.state("stopped")
try:
self.consumer = self.client.subscribe(
topic = self.topic,
subscription_name = self.subscription,
consumer_name = self.consumer_name,
schema = JsonSchema(self.schema),
)
if self.metrics:
self.metrics.state("running")
logger.info("Subscriber running...")
drain_end_time = None
while self.running or self.draining:
# Start drain timeout when entering drain mode
if self.draining and drain_end_time is None:
drain_end_time = time.time() + self.drain_timeout
logger.info(f"Subscriber entering drain mode, timeout={self.drain_timeout}s")
# Stop accepting new messages from Pulsar during drain
self.consumer.pause_message_listener()
# Check drain timeout
if self.draining and time.time() > drain_end_time:
async with self.lock:
total_pending = sum(
q.qsize() for q in
list(self.q.values()) + list(self.full.values())
)
if total_pending > 0:
logger.warning(f"Drain timeout reached with {total_pending} messages in queues")
self.draining = False
break
# Check if we can exit drain mode
if self.draining:
async with self.lock:
all_empty = all(
q.empty() for q in
list(self.q.values()) + list(self.full.values())
)
if all_empty and len(self.pending_acks) == 0:
logger.info("Subscriber queues drained successfully")
self.draining = False
break
# Process messages only if not draining
if not self.draining:
try:
msg = await asyncio.to_thread(
self.consumer.receive,
timeout_millis=250
)
except _pulsar.Timeout:
continue
except Exception as e:
logger.error(f"Exception in subscriber receive: {e}", exc_info=True)
raise e
if self.metrics:
self.metrics.received()
# Process the message
await self._process_message(msg)
else:
# During draining, just wait for queues to empty
await asyncio.sleep(0.1)
except Exception as e:
logger.error(f"Subscriber exception: {e}", exc_info=True)
finally:
# Negative acknowledge any pending messages
for msg in self.pending_acks.values():
self.consumer.negative_acknowledge(msg)
self.pending_acks.clear()
if self.consumer:
self.consumer.unsubscribe()
self.consumer.close()
self.consumer = None
if self.metrics:
self.metrics.state("stopped")
if not self.running and not self.draining:
return
# If handler drops out, sleep a retry
await asyncio.sleep(1)
async def _process_message(self, msg):
"""Process a single message with deferred acknowledgment"""
# Store message for later acknowledgment
msg_id = str(uuid.uuid4())
self.pending_acks[msg_id] = msg
try:
id = msg.properties()["id"]
except:
id = None
value = msg.value()
delivery_success = False
async with self.lock:
# Deliver to specific subscribers
if id in self.q:
delivery_success = await self._deliver_to_queue(
self.q[id], value
)
# Deliver to all subscribers
for q in self.full.values():
if await self._deliver_to_queue(q, value):
delivery_success = True
# Acknowledge only on successful delivery
if delivery_success:
self.consumer.acknowledge(msg)
del self.pending_acks[msg_id]
else:
# Negative acknowledge for retry
self.consumer.negative_acknowledge(msg)
del self.pending_acks[msg_id]
async def _deliver_to_queue(self, queue, value):
"""Deliver message to queue with backpressure handling"""
try:
if self.backpressure_strategy == "block":
# Block until space available (no timeout)
await queue.put(value)
return True
elif self.backpressure_strategy == "drop_oldest":
# Drop oldest message if queue full
if queue.full():
try:
queue.get_nowait()
if self.metrics:
self.metrics.dropped()
except asyncio.QueueEmpty:
pass
await queue.put(value)
return True
elif self.backpressure_strategy == "drop_new":
# Drop new message if queue full
if queue.full():
if self.metrics:
self.metrics.dropped()
return False
await queue.put(value)
return True
except Exception as e:
logger.error(f"Failed to deliver message: {e}")
return False
```
**Key Design Benefits (matching Publisher pattern):**
- **Single Processing Location**: All message processing happens in the `run()` method
- **Clean State Machine**: Three clear states - running, draining, stopped
- **Pause During Drain**: Stops accepting new messages from Pulsar while draining existing queues
- **Timeout Protection**: Won't hang indefinitely during drain
- **Proper Cleanup**: Negative acknowledges any undelivered messages on shutdown
#### B. Export Handler Improvements
**File**: `trustgraph-flow/trustgraph/gateway/dispatch/triples_export.py`
```python
class TriplesExport:
async def destroy(self):
"""Enhanced destroy with graceful shutdown"""
# Step 1: Signal stop to prevent new messages
self.running.stop()
# Step 2: Wait briefly for in-flight messages
await asyncio.sleep(0.5)
# Step 3: Unsubscribe and stop subscriber (triggers queue drain)
if hasattr(self, 'subs'):
await self.subs.unsubscribe_all(self.id)
await self.subs.stop()
# Step 4: Close websocket last
if self.ws and not self.ws.closed:
await self.ws.close()
async def run(self):
"""Enhanced run with better error handling"""
self.subs = Subscriber(
client = self.pulsar_client,
topic = self.queue,
consumer_name = self.consumer,
subscription = self.subscriber,
schema = Triples,
backpressure_strategy = "block" # Configurable
)
await self.subs.start()
self.id = str(uuid.uuid4())
q = await self.subs.subscribe_all(self.id)
consecutive_errors = 0
max_consecutive_errors = 5
while self.running.get():
try:
resp = await asyncio.wait_for(q.get(), timeout=0.5)
await self.ws.send_json(serialize_triples(resp))
consecutive_errors = 0 # Reset on success
except asyncio.TimeoutError:
continue
except queue.Empty:
continue
except Exception as e:
logger.error(f"Exception sending to websocket: {str(e)}")
consecutive_errors += 1
if consecutive_errors >= max_consecutive_errors:
logger.error("Too many consecutive errors, shutting down")
break
# Brief pause before retry
await asyncio.sleep(0.1)
# Graceful cleanup handled in destroy()
```
### 3. Socket-Level Improvements
**File**: `trustgraph-flow/trustgraph/gateway/endpoint/socket.py`
```python
class SocketEndpoint:
async def listener(self, ws, dispatcher, running):
"""Enhanced listener with graceful shutdown"""
async for msg in ws:
if msg.type == WSMsgType.TEXT:
await dispatcher.receive(msg)
continue
elif msg.type == WSMsgType.BINARY:
await dispatcher.receive(msg)
continue
else:
# Graceful shutdown on close
logger.info("Websocket closing, initiating graceful shutdown")
running.stop()
# Allow time for dispatcher cleanup
await asyncio.sleep(1.0)
break
async def handle(self, request):
"""Enhanced handler with better cleanup"""
# ... existing setup code ...
try:
async with asyncio.TaskGroup() as tg:
running = Running()
dispatcher = await self.dispatcher(
ws, running, request.match_info
)
worker_task = tg.create_task(
self.worker(ws, dispatcher, running)
)
lsnr_task = tg.create_task(
self.listener(ws, dispatcher, running)
)
except ExceptionGroup as e:
logger.error("Exception group occurred:", exc_info=True)
# Attempt graceful dispatcher shutdown
try:
await asyncio.wait_for(
dispatcher.destroy(),
timeout=5.0
)
except asyncio.TimeoutError:
logger.warning("Dispatcher shutdown timed out")
except Exception as de:
logger.error(f"Error during dispatcher cleanup: {de}")
except Exception as e:
logger.error(f"Socket exception: {e}", exc_info=True)
finally:
# Ensure dispatcher cleanup
if dispatcher and hasattr(dispatcher, 'destroy'):
try:
await dispatcher.destroy()
except:
pass
# Ensure websocket is closed
if ws and not ws.closed:
await ws.close()
return ws
```
## Configuration Options
Add configuration support for tuning behavior:
```python
# config.py
class GracefulShutdownConfig:
# Publisher settings
PUBLISHER_DRAIN_TIMEOUT = 5.0 # Seconds to wait for queue drain
PUBLISHER_FLUSH_TIMEOUT = 2.0 # Producer flush timeout
# Subscriber settings
SUBSCRIBER_DRAIN_TIMEOUT = 5.0 # Seconds to wait for queue drain
BACKPRESSURE_STRATEGY = "block" # Options: "block", "drop_oldest", "drop_new"
SUBSCRIBER_MAX_QUEUE_SIZE = 100 # Maximum queue size before backpressure
# Socket settings
SHUTDOWN_GRACE_PERIOD = 1.0 # Seconds to wait for graceful shutdown
MAX_CONSECUTIVE_ERRORS = 5 # Maximum errors before forced shutdown
# Monitoring
LOG_QUEUE_STATS = True # Log queue statistics on shutdown
METRICS_ENABLED = True # Enable metrics collection
```
## Testing Strategy
### Unit Tests
```python
async def test_publisher_queue_drain():
"""Verify Publisher drains queue on shutdown"""
publisher = Publisher(...)
# Fill queue with messages
for i in range(10):
await publisher.send(f"id-{i}", {"data": i})
# Stop publisher
await publisher.stop()
# Verify all messages were sent
assert publisher.q.empty()
assert mock_producer.send.call_count == 10
async def test_subscriber_deferred_ack():
"""Verify Subscriber only acks on successful delivery"""
subscriber = Subscriber(..., backpressure_strategy="drop_new")
# Fill queue to capacity
queue = await subscriber.subscribe("test")
for i in range(100):
await queue.put({"data": i})
# Try to add message when full
msg = create_mock_message()
await subscriber._process_message(msg)
# Verify negative acknowledgment
assert msg.negative_acknowledge.called
assert not msg.acknowledge.called
```
### Integration Tests
```python
async def test_import_graceful_shutdown():
"""Test import path handles shutdown gracefully"""
# Setup
import_handler = TriplesImport(...)
await import_handler.start()
# Send messages
messages = []
for i in range(100):
msg = {"metadata": {...}, "triples": [...]}
await import_handler.receive(msg)
messages.append(msg)
# Shutdown while messages in flight
await import_handler.destroy()
# Verify all messages reached Pulsar
received = await pulsar_consumer.receive_all()
assert len(received) == 100
async def test_export_no_message_loss():
"""Test export path doesn't lose acknowledged messages"""
# Setup Pulsar with test messages
for i in range(100):
await pulsar_producer.send({"data": i})
# Start export handler
export_handler = TriplesExport(...)
export_task = asyncio.create_task(export_handler.run())
# Receive some messages
received = []
for _ in range(50):
msg = await websocket.receive()
received.append(msg)
# Force shutdown
await export_handler.destroy()
# Continue receiving until websocket closes
while not websocket.closed:
try:
msg = await websocket.receive()
received.append(msg)
except:
break
# Verify no acknowledged messages were lost
assert len(received) >= 50
```
## Rollout Plan
### Phase 1: Critical Fixes (Week 1)
- Fix Subscriber acknowledgment timing (prevent message loss)
- Add Publisher queue draining
- Deploy to staging environment
### Phase 2: Graceful Shutdown (Week 2)
- Implement shutdown coordination
- Add backpressure strategies
- Performance testing
### Phase 3: Monitoring & Tuning (Week 3)
- Add metrics for queue depths
- Add alerts for message drops
- Tune timeout values based on production data
## Monitoring & Alerts
### Metrics to Track
- `publisher.queue.depth` - Current Publisher queue size
- `publisher.messages.dropped` - Messages lost during shutdown
- `subscriber.messages.negatively_acknowledged` - Failed deliveries
- `websocket.graceful_shutdowns` - Successful graceful shutdowns
- `websocket.forced_shutdowns` - Forced/timeout shutdowns
### Alerts
- Publisher queue depth > 80% capacity
- Any message drops during shutdown
- Subscriber negative acknowledgment rate > 1%
- Shutdown timeout exceeded
## Backwards Compatibility
All changes maintain backwards compatibility:
- Default behavior unchanged without configuration
- Existing deployments continue to function
- Graceful degradation if new features unavailable
## Security Considerations
- No new attack vectors introduced
- Backpressure prevents memory exhaustion attacks
- Configurable limits prevent resource abuse
## Performance Impact
- Minimal overhead during normal operation
- Shutdown may take up to 5 seconds longer (configurable)
- Memory usage bounded by queue size limits
- CPU impact negligible (<1% increase)

View file

@ -0,0 +1,359 @@
# Neo4j User/Collection Isolation Support
## Problem Statement
The Neo4j triples storage and query implementation currently lacks user/collection isolation, which creates a multi-tenancy security issue. All triples are stored in the same graph space without any mechanism to prevent users from accessing other users' data or mixing collections.
Unlike other storage backends in TrustGraph:
- **Cassandra**: Uses separate keyspaces per user and tables per collection
- **Vector stores** (Milvus, Qdrant, Pinecone): Use collection-specific namespaces
- **Neo4j**: Currently shares all data in a single graph (security vulnerability)
## Current Architecture
### Data Model
- **Nodes**: `:Node` label with `uri` property, `:Literal` label with `value` property
- **Relationships**: `:Rel` label with `uri` property
- **Indexes**: `Node.uri`, `Literal.value`, `Rel.uri`
### Message Flow
- `Triples` messages contain `metadata.user` and `metadata.collection` fields
- Storage service receives user/collection info but ignores it
- Query service expects `user` and `collection` in `TriplesQueryRequest` but ignores them
### Current Security Issue
```cypher
# Any user can query any data - no isolation
MATCH (src:Node)-[rel:Rel]->(dest:Node)
RETURN src.uri, rel.uri, dest.uri
```
## Proposed Solution: Property-Based Filtering (Recommended)
### Overview
Add `user` and `collection` properties to all nodes and relationships, then filter all operations by these properties. This approach provides strong isolation while maintaining query flexibility and backwards compatibility.
### Data Model Changes
#### Enhanced Node Structure
```cypher
// Node entities
CREATE (n:Node {
uri: "http://example.com/entity1",
user: "john_doe",
collection: "production_v1"
})
// Literal entities
CREATE (n:Literal {
value: "literal value",
user: "john_doe",
collection: "production_v1"
})
```
#### Enhanced Relationship Structure
```cypher
// Relationships with user/collection properties
CREATE (src)-[:Rel {
uri: "http://example.com/predicate1",
user: "john_doe",
collection: "production_v1"
}]->(dest)
```
#### Updated Indexes
```cypher
// Compound indexes for efficient filtering
CREATE INDEX node_user_collection_uri FOR (n:Node) ON (n.user, n.collection, n.uri);
CREATE INDEX literal_user_collection_value FOR (n:Literal) ON (n.user, n.collection, n.value);
CREATE INDEX rel_user_collection_uri FOR ()-[r:Rel]-() ON (r.user, r.collection, r.uri);
// Maintain existing indexes for backwards compatibility (optional)
CREATE INDEX Node_uri FOR (n:Node) ON (n.uri);
CREATE INDEX Literal_value FOR (n:Literal) ON (n.value);
CREATE INDEX Rel_uri FOR ()-[r:Rel]-() ON (r.uri);
```
### Implementation Changes
#### Storage Service (`write.py`)
**Current Code:**
```python
def create_node(self, uri):
summary = self.io.execute_query(
"MERGE (n:Node {uri: $uri})",
uri=uri, database_=self.db,
).summary
```
**Updated Code:**
```python
def create_node(self, uri, user, collection):
summary = self.io.execute_query(
"MERGE (n:Node {uri: $uri, user: $user, collection: $collection})",
uri=uri, user=user, collection=collection, database_=self.db,
).summary
```
**Enhanced store_triples Method:**
```python
async def store_triples(self, message):
user = message.metadata.user
collection = message.metadata.collection
for t in message.triples:
self.create_node(t.s.value, user, collection)
if t.o.is_uri:
self.create_node(t.o.value, user, collection)
self.relate_node(t.s.value, t.p.value, t.o.value, user, collection)
else:
self.create_literal(t.o.value, user, collection)
self.relate_literal(t.s.value, t.p.value, t.o.value, user, collection)
```
#### Query Service (`service.py`)
**Current Code:**
```python
records, summary, keys = self.io.execute_query(
"MATCH (src:Node {uri: $src})-[rel:Rel {uri: $rel}]->(dest:Node) "
"RETURN dest.uri as dest",
src=query.s.value, rel=query.p.value, database_=self.db,
)
```
**Updated Code:**
```python
records, summary, keys = self.io.execute_query(
"MATCH (src:Node {uri: $src, user: $user, collection: $collection})-"
"[rel:Rel {uri: $rel, user: $user, collection: $collection}]->"
"(dest:Node {user: $user, collection: $collection}) "
"RETURN dest.uri as dest",
src=query.s.value, rel=query.p.value,
user=query.user, collection=query.collection,
database_=self.db,
)
```
### Migration Strategy
#### Phase 1: Add Properties to New Data
1. Update storage service to add user/collection properties to new triples
2. Maintain backwards compatibility by not requiring properties in queries
3. Existing data remains accessible but not isolated
#### Phase 2: Migrate Existing Data
```cypher
// Migrate existing nodes (requires default user/collection assignment)
MATCH (n:Node) WHERE n.user IS NULL
SET n.user = 'legacy_user', n.collection = 'default_collection';
MATCH (n:Literal) WHERE n.user IS NULL
SET n.user = 'legacy_user', n.collection = 'default_collection';
MATCH ()-[r:Rel]->() WHERE r.user IS NULL
SET r.user = 'legacy_user', r.collection = 'default_collection';
```
#### Phase 3: Enforce Isolation
1. Update query service to require user/collection filtering
2. Add validation to reject queries without proper user/collection context
3. Remove legacy data access paths
### Security Considerations
#### Query Validation
```python
async def query_triples(self, query):
# Validate user/collection parameters
if not query.user or not query.collection:
raise ValueError("User and collection must be specified")
# All queries must include user/collection filters
# ... rest of implementation
```
#### Preventing Parameter Injection
- Use parameterized queries exclusively
- Validate user/collection values against allowed patterns
- Consider sanitization for Neo4j property name requirements
#### Audit Trail
```python
logger.info(f"Query executed - User: {query.user}, Collection: {query.collection}, "
f"Pattern: {query.s}/{query.p}/{query.o}")
```
## Alternative Approaches Considered
### Option 2: Label-Based Isolation
**Approach**: Use dynamic labels like `User_john_Collection_prod`
**Pros:**
- Strong isolation through label filtering
- Efficient query performance with label indexes
- Clear data separation
**Cons:**
- Neo4j has practical limits on number of labels (~1000s)
- Complex label name generation and sanitization
- Difficult to query across collections when needed
**Implementation Example:**
```cypher
CREATE (n:Node:User_john_Collection_prod {uri: "http://example.com/entity"})
MATCH (n:User_john_Collection_prod) WHERE n:Node RETURN n
```
### Option 3: Database-Per-User
**Approach**: Create separate Neo4j databases for each user or user/collection combination
**Pros:**
- Complete data isolation
- No risk of cross-contamination
- Independent scaling per user
**Cons:**
- Resource overhead (each database consumes memory)
- Complex database lifecycle management
- Neo4j Community Edition database limits
- Difficult cross-user analytics
### Option 4: Composite Key Strategy
**Approach**: Prefix all URIs and values with user/collection information
**Pros:**
- Backwards compatible with existing queries
- Simple implementation
- No schema changes required
**Cons:**
- URI pollution affects data semantics
- Less efficient queries (string prefix matching)
- Breaks RDF/semantic web standards
**Implementation Example:**
```python
def make_composite_uri(uri, user, collection):
return f"usr:{user}:col:{collection}:uri:{uri}"
```
## Implementation Plan
### Phase 1: Foundation (Week 1)
1. [ ] Update storage service to accept and store user/collection properties
2. [ ] Add compound indexes for efficient querying
3. [ ] Implement backwards compatibility layer
4. [ ] Create unit tests for new functionality
### Phase 2: Query Updates (Week 2)
1. [ ] Update all query patterns to include user/collection filters
2. [ ] Add query validation and security checks
3. [ ] Update integration tests
4. [ ] Performance testing with filtered queries
### Phase 3: Migration & Deployment (Week 3)
1. [ ] Create data migration scripts for existing Neo4j instances
2. [ ] Deployment documentation and runbooks
3. [ ] Monitoring and alerting for isolation violations
4. [ ] End-to-end testing with multiple users/collections
### Phase 4: Hardening (Week 4)
1. [ ] Remove legacy compatibility mode
2. [ ] Add comprehensive audit logging
3. [ ] Security review and penetration testing
4. [ ] Performance optimization
## Testing Strategy
### Unit Tests
```python
def test_user_collection_isolation():
# Store triples for user1/collection1
processor.store_triples(triples_user1_coll1)
# Store triples for user2/collection2
processor.store_triples(triples_user2_coll2)
# Query as user1 should only return user1's data
results = processor.query_triples(query_user1_coll1)
assert all_results_belong_to_user1_coll1(results)
# Query as user2 should only return user2's data
results = processor.query_triples(query_user2_coll2)
assert all_results_belong_to_user2_coll2(results)
```
### Integration Tests
- Multi-user scenarios with overlapping data
- Cross-collection queries (should fail)
- Migration testing with existing data
- Performance benchmarks with large datasets
### Security Tests
- Attempt to query other users' data
- SQL injection style attacks on user/collection parameters
- Verify complete isolation under various query patterns
## Performance Considerations
### Index Strategy
- Compound indexes on `(user, collection, uri)` for optimal filtering
- Consider partial indexes if some collections are much larger
- Monitor index usage and query performance
### Query Optimization
- Use EXPLAIN to verify index usage in filtered queries
- Consider query result caching for frequently accessed data
- Profile memory usage with large numbers of users/collections
### Scalability
- Each user/collection combination creates separate data islands
- Monitor database size and connection pool usage
- Consider horizontal scaling strategies if needed
## Security & Compliance
### Data Isolation Guarantees
- **Physical**: All user data stored with explicit user/collection properties
- **Logical**: All queries filtered by user/collection context
- **Access Control**: Service-level validation prevents unauthorized access
### Audit Requirements
- Log all data access with user/collection context
- Track migration activities and data movements
- Monitor for isolation violation attempts
### Compliance Considerations
- GDPR: Enhanced ability to locate and delete user-specific data
- SOC2: Clear data isolation and access controls
- HIPAA: Strong tenant isolation for healthcare data
## Risks & Mitigations
| Risk | Impact | Likelihood | Mitigation |
|------|--------|------------|------------|
| Query missing user/collection filter | High | Medium | Mandatory validation, comprehensive testing |
| Performance degradation | Medium | Low | Index optimization, query profiling |
| Migration data corruption | High | Low | Backup strategy, rollback procedures |
| Complex multi-collection queries | Medium | Medium | Document query patterns, provide examples |
## Success Criteria
1. **Security**: Zero cross-user data access in production
2. **Performance**: <10% query performance impact vs unfiltered queries
3. **Migration**: 100% existing data successfully migrated with zero loss
4. **Usability**: All existing query patterns work with user/collection context
5. **Compliance**: Full audit trail of user/collection data access
## Conclusion
The property-based filtering approach provides the best balance of security, performance, and maintainability for adding user/collection isolation to Neo4j. It aligns with TrustGraph's existing multi-tenancy patterns while leveraging Neo4j's strengths in graph querying and indexing.
This solution ensures TrustGraph's Neo4j backend meets the same security standards as other storage backends, preventing data isolation vulnerabilities while maintaining the flexibility and power of graph queries.

View file

@ -0,0 +1,559 @@
# Structured Data Descriptor Specification
## Overview
The Structured Data Descriptor is a JSON-based configuration language that describes how to parse, transform, and import structured data into TrustGraph. It provides a declarative approach to data ingestion, supporting multiple input formats and complex transformation pipelines without requiring custom code.
## Core Concepts
### 1. Format Definition
Describes the input file type and parsing options. Determines which parser to use and how to interpret the source data.
### 2. Field Mappings
Maps source paths to target fields with transformations. Defines how data flows from input sources to output schema fields.
### 3. Transform Pipeline
Chain of data transformations that can be applied to field values, including:
- Data cleaning (trim, normalize)
- Format conversion (date parsing, type casting)
- Calculations (arithmetic, string manipulation)
- Lookups (reference tables, substitutions)
### 4. Validation Rules
Data quality checks applied to ensure data integrity:
- Type validation
- Range checks
- Pattern matching (regex)
- Required field validation
- Custom validation logic
### 5. Global Settings
Configuration that applies across the entire import process:
- Lookup tables for data enrichment
- Global variables and constants
- Output format specifications
- Error handling policies
## Implementation Strategy
The importer implementation follows this pipeline:
1. **Parse Configuration** - Load and validate the JSON descriptor
2. **Initialize Parser** - Load appropriate parser (CSV, XML, JSON, etc.) based on `format.type`
3. **Apply Preprocessing** - Execute global filters and transformations
4. **Process Records** - For each input record:
- Extract data using source paths (JSONPath, XPath, column names)
- Apply field-level transforms in sequence
- Validate results against defined rules
- Apply default values for missing data
5. **Apply Postprocessing** - Execute deduplication, aggregation, etc.
6. **Generate Output** - Produce data in specified target format
## Path Expression Support
Different input formats use appropriate path expression languages:
- **CSV**: Column names or indices (`"column_name"` or `"[2]"`)
- **JSON**: JSONPath syntax (`"$.user.profile.email"`)
- **XML**: XPath expressions (`"//product[@id='123']/price"`)
- **Fixed-width**: Field names from field definitions
## Benefits
- **Single Codebase** - One importer handles multiple input formats
- **User-Friendly** - Non-technical users can create configurations
- **Reusable** - Configurations can be shared and versioned
- **Flexible** - Complex transformations without custom coding
- **Robust** - Built-in validation and comprehensive error handling
- **Maintainable** - Declarative approach reduces implementation complexity
## Language Specification
The Structured Data Descriptor uses a JSON configuration format with the following top-level structure:
```json
{
"version": "1.0",
"metadata": {
"name": "Configuration Name",
"description": "Description of what this config does",
"author": "Author Name",
"created": "2024-01-01T00:00:00Z"
},
"format": { ... },
"globals": { ... },
"preprocessing": [ ... ],
"mappings": [ ... ],
"postprocessing": [ ... ],
"output": { ... }
}
```
### Format Definition
Describes the input data format and parsing options:
```json
{
"format": {
"type": "csv|json|xml|fixed-width|excel|parquet",
"encoding": "utf-8",
"options": {
// Format-specific options
}
}
}
```
#### CSV Format Options
```json
{
"format": {
"type": "csv",
"options": {
"delimiter": ",",
"quote_char": "\"",
"escape_char": "\\",
"skip_rows": 1,
"has_header": true,
"null_values": ["", "NULL", "null", "N/A"]
}
}
}
```
#### JSON Format Options
```json
{
"format": {
"type": "json",
"options": {
"root_path": "$.data",
"array_mode": "records|single",
"flatten": false
}
}
}
```
#### XML Format Options
```json
{
"format": {
"type": "xml",
"options": {
"root_element": "//records/record",
"namespaces": {
"ns": "http://example.com/namespace"
}
}
}
}
```
### Global Settings
Define lookup tables, variables, and global configuration:
```json
{
"globals": {
"variables": {
"current_date": "2024-01-01",
"batch_id": "BATCH_001",
"default_confidence": 0.8
},
"lookup_tables": {
"country_codes": {
"US": "United States",
"UK": "United Kingdom",
"CA": "Canada"
},
"status_mapping": {
"1": "active",
"0": "inactive"
}
},
"constants": {
"source_system": "legacy_crm",
"import_type": "full"
}
}
}
```
### Field Mappings
Define how source data maps to target fields with transformations:
```json
{
"mappings": [
{
"target_field": "person_name",
"source": "$.name",
"transforms": [
{"type": "trim"},
{"type": "title_case"},
{"type": "required"}
],
"validation": [
{"type": "min_length", "value": 2},
{"type": "max_length", "value": 100},
{"type": "pattern", "value": "^[A-Za-z\\s]+$"}
]
},
{
"target_field": "age",
"source": "$.age",
"transforms": [
{"type": "to_int"},
{"type": "default", "value": 0}
],
"validation": [
{"type": "range", "min": 0, "max": 150}
]
},
{
"target_field": "country",
"source": "$.country_code",
"transforms": [
{"type": "lookup", "table": "country_codes"},
{"type": "default", "value": "Unknown"}
]
}
]
}
```
### Transform Types
Available transformation functions:
#### String Transforms
```json
{"type": "trim"},
{"type": "upper"},
{"type": "lower"},
{"type": "title_case"},
{"type": "replace", "pattern": "old", "replacement": "new"},
{"type": "regex_replace", "pattern": "\\d+", "replacement": "XXX"},
{"type": "substring", "start": 0, "end": 10},
{"type": "pad_left", "length": 10, "char": "0"}
```
#### Type Conversions
```json
{"type": "to_string"},
{"type": "to_int"},
{"type": "to_float"},
{"type": "to_bool"},
{"type": "to_date", "format": "YYYY-MM-DD"},
{"type": "parse_json"}
```
#### Data Operations
```json
{"type": "default", "value": "default_value"},
{"type": "lookup", "table": "table_name"},
{"type": "concat", "values": ["field1", " - ", "field2"]},
{"type": "calculate", "expression": "${field1} + ${field2}"},
{"type": "conditional", "condition": "${age} > 18", "true_value": "adult", "false_value": "minor"}
```
### Validation Rules
Data quality checks with configurable error handling:
#### Basic Validations
```json
{"type": "required"},
{"type": "not_null"},
{"type": "min_length", "value": 5},
{"type": "max_length", "value": 100},
{"type": "range", "min": 0, "max": 1000},
{"type": "pattern", "value": "^[A-Z]{2,3}$"},
{"type": "in_list", "values": ["active", "inactive", "pending"]}
```
#### Custom Validations
```json
{
"type": "custom",
"expression": "${age} >= 18 && ${country} == 'US'",
"message": "Must be 18+ and in US"
},
{
"type": "cross_field",
"fields": ["start_date", "end_date"],
"expression": "${start_date} < ${end_date}",
"message": "Start date must be before end date"
}
```
### Preprocessing and Postprocessing
Global operations applied before/after field mapping:
```json
{
"preprocessing": [
{
"type": "filter",
"condition": "${status} != 'deleted'"
},
{
"type": "sort",
"field": "created_date",
"order": "asc"
}
],
"postprocessing": [
{
"type": "deduplicate",
"key_fields": ["email", "phone"]
},
{
"type": "aggregate",
"group_by": ["country"],
"functions": {
"total_count": {"type": "count"},
"avg_age": {"type": "avg", "field": "age"}
}
}
]
}
```
### Output Configuration
Define how processed data should be output:
```json
{
"output": {
"format": "trustgraph-objects",
"schema_name": "person",
"options": {
"batch_size": 1000,
"confidence": 0.9,
"source_span_field": "raw_text",
"metadata": {
"source": "crm_import",
"version": "1.0"
}
},
"error_handling": {
"on_validation_error": "skip|fail|log",
"on_transform_error": "skip|fail|default",
"max_errors": 100,
"error_output": "errors.json"
}
}
}
```
## Complete Example
```json
{
"version": "1.0",
"metadata": {
"name": "Customer Import from CRM CSV",
"description": "Imports customer data from legacy CRM system",
"author": "Data Team",
"created": "2024-01-01T00:00:00Z"
},
"format": {
"type": "csv",
"encoding": "utf-8",
"options": {
"delimiter": ",",
"has_header": true,
"skip_rows": 1
}
},
"globals": {
"variables": {
"import_date": "2024-01-01",
"default_confidence": 0.85
},
"lookup_tables": {
"country_codes": {
"US": "United States",
"CA": "Canada",
"UK": "United Kingdom"
}
}
},
"preprocessing": [
{
"type": "filter",
"condition": "${status} == 'active'"
}
],
"mappings": [
{
"target_field": "full_name",
"source": "customer_name",
"transforms": [
{"type": "trim"},
{"type": "title_case"}
],
"validation": [
{"type": "required"},
{"type": "min_length", "value": 2}
]
},
{
"target_field": "email",
"source": "email_address",
"transforms": [
{"type": "trim"},
{"type": "lower"}
],
"validation": [
{"type": "pattern", "value": "^[\\w.-]+@[\\w.-]+\\.[a-zA-Z]{2,}$"}
]
},
{
"target_field": "age",
"source": "age",
"transforms": [
{"type": "to_int"},
{"type": "default", "value": 0}
],
"validation": [
{"type": "range", "min": 0, "max": 120}
]
},
{
"target_field": "country",
"source": "country_code",
"transforms": [
{"type": "lookup", "table": "country_codes"},
{"type": "default", "value": "Unknown"}
]
}
],
"output": {
"format": "trustgraph-objects",
"schema_name": "customer",
"options": {
"confidence": "${default_confidence}",
"batch_size": 500
},
"error_handling": {
"on_validation_error": "log",
"max_errors": 50
}
}
}
```
## LLM Prompt for Descriptor Generation
The following prompt can be used to have an LLM analyze sample data and generate a descriptor configuration:
```
I need you to analyze the provided data sample and create a Structured Data Descriptor configuration in JSON format.
The descriptor should follow this specification:
- version: "1.0"
- metadata: Configuration name, description, author, and creation date
- format: Input format type and parsing options
- globals: Variables, lookup tables, and constants
- preprocessing: Filters and transformations applied before mapping
- mappings: Field-by-field mapping from source to target with transformations and validations
- postprocessing: Operations like deduplication or aggregation
- output: Target format and error handling configuration
ANALYZE THE DATA:
1. Identify the format (CSV, JSON, XML, etc.)
2. Detect delimiters, encodings, and structure
3. Find data types for each field
4. Identify patterns and constraints
5. Look for fields that need cleaning or transformation
6. Find relationships between fields
7. Identify lookup opportunities (codes that map to values)
8. Detect required vs optional fields
CREATE THE DESCRIPTOR:
For each field in the sample data:
- Map it to an appropriate target field name
- Add necessary transformations (trim, case conversion, type casting)
- Include appropriate validations (required, patterns, ranges)
- Set defaults for missing values
Include preprocessing if needed:
- Filters to exclude invalid records
- Sorting requirements
Include postprocessing if beneficial:
- Deduplication on key fields
- Aggregation for summary data
Configure output for TrustGraph:
- format: "trustgraph-objects"
- schema_name: Based on the data entity type
- Appropriate error handling
DATA SAMPLE:
[Insert data sample here]
ADDITIONAL CONTEXT (optional):
- Target schema name: [if known]
- Business rules: [any specific requirements]
- Data quality issues to address: [known problems]
Generate a complete, valid Structured Data Descriptor configuration that will properly import this data into TrustGraph. Include comments explaining key decisions.
```
### Example Usage Prompt
```
I need you to analyze the provided data sample and create a Structured Data Descriptor configuration in JSON format.
[Standard instructions from above...]
DATA SAMPLE:
```csv
CustomerID,Name,Email,Age,Country,Status,JoinDate,TotalPurchases
1001,"Smith, John",john.smith@email.com,35,US,1,2023-01-15,5420.50
1002,"doe, jane",JANE.DOE@GMAIL.COM,28,CA,1,2023-03-22,3200.00
1003,"Bob Johnson",bob@,62,UK,0,2022-11-01,0
1004,"Alice Chen","alice.chen@company.org",41,US,1,2023-06-10,8900.25
1005,,invalid-email,25,XX,1,2024-01-01,100
```
ADDITIONAL CONTEXT:
- Target schema name: customer
- Business rules: Email should be valid and lowercase, names should be title case
- Data quality issues: Some emails are invalid, some names are missing, country codes need mapping
```
### Prompt for Analyzing Existing Data Without Sample
```
I need you to help me create a Structured Data Descriptor configuration for importing [data type] data.
The source data has these characteristics:
- Format: [CSV/JSON/XML/etc]
- Fields: [list the fields]
- Data quality issues: [describe any known issues]
- Volume: [approximate number of records]
Requirements:
- [List any specific transformation needs]
- [List any validation requirements]
- [List any business rules]
Please generate a Structured Data Descriptor configuration that will:
1. Parse the input format correctly
2. Clean and standardize the data
3. Validate according to the requirements
4. Handle errors gracefully
5. Output in TrustGraph ExtractedObject format
Focus on making the configuration robust and reusable.
```

View file

@ -114,7 +114,7 @@ The structured data integration requires the following technical components:
Module: trustgraph-flow/trustgraph/storage/objects/cassandra
5. **Structured Query Service**
5. **Structured Query Service****[COMPLETE]**
- Accepts structured queries in defined formats
- Executes queries against the structured store
- Returns objects matching query criteria

View file

@ -0,0 +1,273 @@
# Structured Data Diagnostic Service Technical Specification
## Overview
This specification describes a new invokable service for diagnosing and analyzing structured data within TrustGraph. The service extracts functionality from the existing `tg-load-structured-data` command-line tool and exposes it as a request/response service, enabling programmatic access to data type detection and descriptor generation capabilities.
The service supports three primary operations:
1. **Data Type Detection**: Analyze a data sample to determine its format (CSV, JSON, or XML)
2. **Descriptor Generation**: Generate a TrustGraph structured data descriptor for a given data sample and type
3. **Combined Diagnosis**: Perform both type detection and descriptor generation in sequence
## Goals
- **Modularize Data Analysis**: Extract data diagnosis logic from CLI into reusable service components
- **Enable Programmatic Access**: Provide API-based access to data analysis capabilities
- **Support Multiple Data Formats**: Handle CSV, JSON, and XML data formats consistently
- **Generate Accurate Descriptors**: Produce structured data descriptors that accurately map source data to TrustGraph schemas
- **Maintain Backward Compatibility**: Ensure existing CLI functionality continues to work
- **Enable Service Composition**: Allow other services to leverage data diagnosis capabilities
- **Improve Testability**: Separate business logic from CLI interface for better testing
- **Support Streaming Analysis**: Enable analysis of data samples without loading entire files
## Background
Currently, the `tg-load-structured-data` command provides comprehensive functionality for analyzing structured data and generating descriptors. However, this functionality is tightly coupled to the CLI interface, limiting its reusability.
Current limitations include:
- Data diagnosis logic embedded in CLI code
- No programmatic access to type detection and descriptor generation
- Difficult to integrate diagnosis capabilities into other services
- Limited ability to compose data analysis workflows
This specification addresses these gaps by creating a dedicated service for structured data diagnosis. By exposing these capabilities as a service, TrustGraph can:
- Enable other services to analyze data programmatically
- Support more complex data processing pipelines
- Facilitate integration with external systems
- Improve maintainability through separation of concerns
## Technical Design
### Architecture
The structured data diagnostic service requires the following technical components:
1. **Diagnostic Service Processor**
- Handles incoming diagnosis requests
- Orchestrates type detection and descriptor generation
- Returns structured responses with diagnosis results
Module: `trustgraph-flow/trustgraph/diagnosis/structured_data/service.py`
2. **Data Type Detector**
- Uses algorithmic detection to identify data format (CSV, JSON, XML)
- Analyzes data structure, delimiters, and syntax patterns
- Returns detected format and confidence scores
Module: `trustgraph-flow/trustgraph/diagnosis/structured_data/type_detector.py`
3. **Descriptor Generator**
- Uses prompt service to generate descriptors
- Invokes format-specific prompts (diagnose-csv, diagnose-json, diagnose-xml)
- Maps data fields to TrustGraph schema fields through prompt responses
Module: `trustgraph-flow/trustgraph/diagnosis/structured_data/descriptor_generator.py`
### Data Models
#### StructuredDataDiagnosisRequest
Request message for structured data diagnosis operations:
```python
class StructuredDataDiagnosisRequest:
operation: str # "detect-type", "generate-descriptor", or "diagnose"
sample: str # Data sample to analyze (text content)
type: Optional[str] # Data type (csv, json, xml) - required for generate-descriptor
schema_name: Optional[str] # Target schema name for descriptor generation
options: Dict[str, Any] # Additional options (e.g., delimiter for CSV)
```
#### StructuredDataDiagnosisResponse
Response message containing diagnosis results:
```python
class StructuredDataDiagnosisResponse:
operation: str # The operation that was performed
detected_type: Optional[str] # Detected data type (for detect-type/diagnose)
confidence: Optional[float] # Confidence score for type detection
descriptor: Optional[Dict] # Generated descriptor (for generate-descriptor/diagnose)
error: Optional[str] # Error message if operation failed
metadata: Dict[str, Any] # Additional metadata (e.g., field count, sample records)
```
#### Descriptor Structure
The generated descriptor follows the existing structured data descriptor format:
```json
{
"format": {
"type": "csv",
"encoding": "utf-8",
"options": {
"delimiter": ",",
"has_header": true
}
},
"mappings": [
{
"source_field": "customer_id",
"target_field": "id",
"transforms": [
{"type": "trim"}
]
}
],
"output": {
"schema_name": "customer",
"options": {
"batch_size": 1000,
"confidence": 0.9
}
}
}
```
### Service Interface
The service will expose the following operations through the request/response pattern:
1. **Type Detection Operation**
- Input: Data sample
- Processing: Analyze data structure using algorithmic detection
- Output: Detected type with confidence score
2. **Descriptor Generation Operation**
- Input: Data sample, type, target schema name
- Processing:
- Call prompt service with format-specific prompt ID (diagnose-csv, diagnose-json, or diagnose-xml)
- Pass data sample and available schemas to prompt
- Receive generated descriptor from prompt response
- Output: Structured data descriptor
3. **Combined Diagnosis Operation**
- Input: Data sample, optional schema name
- Processing:
- Use algorithmic detection to identify format first
- Select appropriate format-specific prompt based on detected type
- Call prompt service to generate descriptor
- Output: Both detected type and descriptor
### Implementation Details
The service will follow TrustGraph service conventions:
1. **Service Registration**
- Register as `structured-diag` service type
- Use standard request/response topics
- Implement FlowProcessor base class
- Register PromptClientSpec for prompt service interaction
2. **Configuration Management**
- Access schema configurations via config service
- Cache schemas for performance
- Handle configuration updates dynamically
3. **Prompt Integration**
- Use existing prompt service infrastructure
- Call prompt service with format-specific prompt IDs:
- `diagnose-csv`: For CSV data analysis
- `diagnose-json`: For JSON data analysis
- `diagnose-xml`: For XML data analysis
- Prompts are configured in prompt config, not hard-coded in service
- Pass schemas and data samples as prompt variables
- Parse prompt responses to extract descriptors
4. **Error Handling**
- Validate input data samples
- Provide descriptive error messages
- Handle malformed data gracefully
- Handle prompt service failures
5. **Data Sampling**
- Process configurable sample sizes
- Handle incomplete records appropriately
- Maintain sampling consistency
### API Integration
The service will integrate with existing TrustGraph APIs:
Modified Components:
- `tg-load-structured-data` CLI - Refactored to use the new service for diagnosis operations
- Flow API - Extended to support structured data diagnosis requests
New Service Endpoints:
- `/api/v1/flow/{flow}/diagnose/structured-data` - WebSocket endpoint for diagnosis requests
- `/api/v1/diagnose/structured-data` - REST endpoint for synchronous diagnosis
### Message Flow
```
Client → Gateway → Structured Diag Service → Config Service (for schemas)
Type Detector (algorithmic)
Prompt Service (diagnose-csv/json/xml)
Descriptor Generator (parses prompt response)
Client ← Gateway ← Structured Diag Service (response)
```
## Security Considerations
- Input validation to prevent injection attacks
- Size limits on data samples to prevent DoS
- Sanitization of generated descriptors
- Access control through existing TrustGraph authentication
## Performance Considerations
- Cache schema definitions to reduce config service calls
- Limit sample sizes to maintain responsive performance
- Use streaming processing for large data samples
- Implement timeout mechanisms for long-running analyses
## Testing Strategy
1. **Unit Tests**
- Type detection for various data formats
- Descriptor generation accuracy
- Error handling scenarios
2. **Integration Tests**
- Service request/response flow
- Schema retrieval and caching
- CLI integration
3. **Performance Tests**
- Large sample processing
- Concurrent request handling
- Memory usage under load
## Migration Plan
1. **Phase 1**: Implement service with core functionality
2. **Phase 2**: Refactor CLI to use service (maintain backward compatibility)
3. **Phase 3**: Add REST API endpoints
4. **Phase 4**: Deprecate embedded CLI logic (with notice period)
## Timeline
- Week 1-2: Implement core service and type detection
- Week 3-4: Add descriptor generation and integration
- Week 5: Testing and documentation
- Week 6: CLI refactoring and migration
## Open Questions
- Should the service support additional data formats (e.g., Parquet, Avro)?
- What should be the maximum sample size for analysis?
- Should diagnosis results be cached for repeated requests?
- How should the service handle multi-schema scenarios?
- Should the prompt IDs be configurable parameters for the service?
## References
- [Structured Data Descriptor Specification](structured-data-descriptor.md)
- [Structured Data Loading Documentation](structured-data.md)
- `tg-load-structured-data` implementation: `trustgraph-cli/trustgraph/cli/load_structured_data.py`

View file

@ -0,0 +1,491 @@
# TrustGraph Tool Group System
## Technical Specification v1.0
### Executive Summary
This specification defines a tool grouping system for TrustGraph agents that allows fine-grained control over which tools are available for specific requests. The system introduces group-based tool filtering through configuration and request-level specification, enabling better security boundaries, resource management, and functional partitioning of agent capabilities.
### 1. Overview
#### 1.1 Problem Statement
Currently, TrustGraph agents have access to all configured tools regardless of request context or security requirements. This creates several challenges:
- **Security Risk**: Sensitive tools (e.g., data modification) are available even for read-only queries
- **Resource Waste**: Complex tools are loaded even when simple queries don't require them
- **Functional Confusion**: Agents may select inappropriate tools when simpler alternatives exist
- **Multi-tenant Isolation**: Different user groups need access to different tool sets
#### 1.2 Solution Overview
The tool group system introduces:
1. **Group Classification**: Tools are tagged with group memberships during configuration
2. **Request-level Filtering**: AgentRequest specifies which tool groups are permitted
3. **Runtime Enforcement**: Agents only have access to tools matching the requested groups
4. **Flexible Grouping**: Tools can belong to multiple groups for complex scenarios
### 2. Schema Changes
#### 2.1 Tool Configuration Schema Enhancement
The existing tool configuration is enhanced with a `group` field:
**Before:**
```json
{
"name": "knowledge-query",
"type": "knowledge-query",
"description": "Query the knowledge graph"
}
```
**After:**
```json
{
"name": "knowledge-query",
"type": "knowledge-query",
"description": "Query the knowledge graph",
"group": ["read-only", "knowledge", "basic"]
}
```
**Group Field Specification:**
- `group`: Array(String) - List of groups this tool belongs to
- **Optional**: Tools without group field belong to "default" group
- **Multi-membership**: Tools can belong to multiple groups
- **Case-sensitive**: Group names are exact string matches
#### 2.1.2 Tool State Transition Enhancement
Tools can optionally specify state transitions and state-based availability:
```json
{
"name": "knowledge-query",
"type": "knowledge-query",
"description": "Query the knowledge graph",
"group": ["read-only", "knowledge", "basic"],
"state": "analysis",
"available_in_states": ["undefined", "research"]
}
```
**State Field Specification:**
- `state`: String - **Optional** - State to transition to after successful tool execution
- `available_in_states`: Array(String) - **Optional** - States in which this tool is available
- **Default behavior**: Tools without `available_in_states` are available in all states
- **State transition**: Only occurs after successful tool execution
#### 2.2 AgentRequest Schema Enhancement
The `AgentRequest` schema in `trustgraph-base/trustgraph/schema/services/agent.py` is enhanced:
**Current AgentRequest:**
- `question`: String - User query
- `plan`: String - Execution plan (can be removed)
- `state`: String - Agent state
- `history`: Array(AgentStep) - Execution history
**Enhanced AgentRequest:**
- `question`: String - User query
- `state`: String - Agent execution state (now actively used for tool filtering)
- `history`: Array(AgentStep) - Execution history
- `group`: Array(String) - **NEW** - Tool groups allowed for this request
**Schema Changes:**
- **Removed**: `plan` field is no longer needed and can be removed (was originally intended for tool specification)
- **Added**: `group` field for tool group specification
- **Enhanced**: `state` field now controls tool availability during execution
**Field Behaviors:**
**Group Field:**
- **Optional**: If not specified, defaults to ["default"]
- **Intersection**: Only tools matching at least one specified group are available
- **Empty array**: No tools available (agent can only use internal reasoning)
- **Wildcard**: Special group "*" grants access to all tools
**State Field:**
- **Optional**: If not specified, defaults to "undefined"
- **State-based filtering**: Only tools available in current state are eligible
- **Default state**: "undefined" state allows all tools (subject to group filtering)
- **State transitions**: Tools can change state after successful execution
### 3. Custom Group Examples
Organizations can define domain-specific groups:
```json
{
"financial-tools": ["stock-query", "portfolio-analysis"],
"medical-tools": ["diagnosis-assist", "drug-interaction"],
"legal-tools": ["contract-analysis", "case-search"]
}
```
### 4. Implementation Details
#### 4.1 Tool Loading and Filtering
**Configuration Phase:**
1. All tools are loaded from configuration with their group assignments
2. Tools without explicit groups are assigned to "default" group
3. Group membership is validated and stored in tool registry
**Request Processing Phase:**
1. AgentRequest arrives with optional group specification
2. Agent filters available tools based on group intersection
3. Only matching tools are passed to agent execution context
4. Agent operates with filtered tool set throughout request lifecycle
#### 4.2 Tool Filtering Logic
**Combined Group and State Filtering:**
```
For each configured tool:
tool_groups = tool.group || ["default"]
tool_states = tool.available_in_states || ["*"] // Available in all states
For each request:
requested_groups = request.group || ["default"]
current_state = request.state || "undefined"
Tool is available if:
// Group filtering
(intersection(tool_groups, requested_groups) is not empty OR "*" in requested_groups)
AND
// State filtering
(current_state in tool_states OR "*" in tool_states)
```
**State Transition Logic:**
```
After successful tool execution:
if tool.state is defined:
next_request.state = tool.state
else:
next_request.state = current_request.state // No change
```
#### 4.3 Agent Integration Points
**ReAct Agent:**
- Tool filtering occurs in agent_manager.py during tool registry creation
- Available tools list is filtered by both group and state before plan generation
- State transitions update AgentRequest.state field after successful tool execution
- Next iteration uses updated state for tool filtering
**Confidence-Based Agent:**
- Tool filtering occurs in planner.py during plan generation
- ExecutionStep validation ensures only group+state eligible tools are used
- Flow controller enforces tool availability at runtime
- State transitions managed by Flow Controller between steps
### 5. Configuration Examples
#### 5.1 Tool Configuration with Groups and States
```yaml
tool:
knowledge-query:
type: knowledge-query
name: "Knowledge Graph Query"
description: "Query the knowledge graph for entities and relationships"
group: ["read-only", "knowledge", "basic"]
state: "analysis"
available_in_states: ["undefined", "research"]
graph-update:
type: graph-update
name: "Graph Update"
description: "Add or modify entities in the knowledge graph"
group: ["write", "knowledge", "admin"]
available_in_states: ["analysis", "modification"]
text-completion:
type: text-completion
name: "Text Completion"
description: "Generate text using language models"
group: ["read-only", "text", "basic"]
state: "undefined"
# No available_in_states = available in all states
complex-analysis:
type: mcp-tool
name: "Complex Analysis Tool"
description: "Perform complex data analysis"
group: ["advanced", "compute", "expensive"]
state: "results"
available_in_states: ["analysis"]
mcp_tool_id: "analysis-server"
reset-workflow:
type: mcp-tool
name: "Reset Workflow"
description: "Reset to initial state"
group: ["admin"]
state: "undefined"
available_in_states: ["analysis", "results"]
```
#### 5.2 Request Examples with State Workflows
**Initial Research Request:**
```json
{
"question": "What entities are connected to Company X?",
"group": ["read-only", "knowledge"],
"state": "undefined"
}
```
*Available tools: knowledge-query, text-completion*
*After knowledge-query: state → "analysis"*
**Analysis Phase:**
```json
{
"question": "Continue analysis based on previous results",
"group": ["advanced", "compute", "write"],
"state": "analysis"
}
```
*Available tools: complex-analysis, graph-update, reset-workflow*
*After complex-analysis: state → "results"*
**Results Phase:**
```json
{
"question": "What should I do with these results?",
"group": ["admin"],
"state": "results"
}
```
*Available tools: reset-workflow only*
*After reset-workflow: state → "undefined"*
**Workflow Example - Complete Flow:**
1. **Start (undefined)**: Use knowledge-query → transitions to "analysis"
2. **Analysis state**: Use complex-analysis → transitions to "results"
3. **Results state**: Use reset-workflow → transitions back to "undefined"
4. **Back to start**: All initial tools available again
### 6. Security Considerations
#### 6.1 Access Control Integration
**Gateway-Level Filtering:**
- Gateway can enforce group restrictions based on user permissions
- Prevent elevation of privileges through request manipulation
- Audit trail includes requested and granted tool groups
**Example Gateway Logic:**
```
user_permissions = get_user_permissions(request.user_id)
allowed_groups = user_permissions.tool_groups
requested_groups = request.group
# Validate request doesn't exceed permissions
if not is_subset(requested_groups, allowed_groups):
reject_request("Insufficient permissions for requested tool groups")
```
#### 6.2 Audit and Monitoring
**Enhanced Audit Trail:**
- Log requested tool groups and initial state per request
- Track state transitions and tool usage by group membership
- Monitor unauthorized group access attempts and invalid state transitions
- Alert on unusual group usage patterns or suspicious state workflows
### 7. Migration Strategy
#### 7.1 Backward Compatibility
**Phase 1: Additive Changes**
- Add optional `group` field to tool configurations
- Add optional `group` field to AgentRequest schema
- Default behavior: All existing tools belong to "default" group
- Existing requests without group field use "default" group
**Existing Behavior Preserved:**
- Tools without group configuration continue to work (default group)
- Tools without state configuration are available in all states
- Requests without group specification access all tools (default group)
- Requests without state specification use "undefined" state (all tools available)
- No breaking changes to existing deployments
### 8. Monitoring and Observability
#### 8.1 New Metrics
**Tool Group Usage:**
- `agent_tool_group_requests_total` - Counter of requests by group
- `agent_tool_group_availability` - Gauge of tools available per group
- `agent_filtered_tools_count` - Histogram of tool count after group+state filtering
**State Workflow Metrics:**
- `agent_state_transitions_total` - Counter of state transitions by tool
- `agent_workflow_duration_seconds` - Histogram of time spent in each state
- `agent_state_availability` - Gauge of tools available per state
**Security Metrics:**
- `agent_group_access_denied_total` - Counter of unauthorized group access
- `agent_invalid_state_transition_total` - Counter of invalid state transitions
- `agent_privilege_escalation_attempts_total` - Counter of suspicious requests
#### 8.2 Logging Enhancements
**Request Logging:**
```json
{
"request_id": "req-123",
"requested_groups": ["read-only", "knowledge"],
"initial_state": "undefined",
"state_transitions": [
{"tool": "knowledge-query", "from": "undefined", "to": "analysis", "timestamp": "2024-01-01T10:00:01Z"}
],
"available_tools": ["knowledge-query", "text-completion"],
"filtered_by_group": ["graph-update", "admin-tool"],
"filtered_by_state": [],
"execution_time": "1.2s"
}
```
### 9. Testing Strategy
#### 9.1 Unit Tests
**Tool Filtering Logic:**
- Test group intersection calculations
- Test state-based filtering logic
- Verify default group and state assignment
- Test wildcard group behavior
- Validate empty group handling
- Test combined group+state filtering scenarios
**Configuration Validation:**
- Test tool loading with various group and state configurations
- Verify schema validation for invalid group and state specifications
- Test backward compatibility with existing configurations
- Validate state transition definitions and cycles
#### 9.2 Integration Tests
**Agent Behavior:**
- Verify agents only see group+state filtered tools
- Test request execution with various group combinations
- Test state transitions during agent execution
- Validate error handling when no tools are available
- Test workflow progression through multiple states
**Security Testing:**
- Test privilege escalation prevention
- Verify audit trail accuracy
- Test gateway integration with user permissions
#### 9.3 End-to-End Scenarios
**Multi-tenant Usage with State Workflows:**
```
Scenario: Different users with different tool access and workflow states
Given: User A has "read-only" permissions, state "undefined"
And: User B has "write" permissions, state "analysis"
When: Both request knowledge operations
Then: User A gets read-only tools available in "undefined" state
And: User B gets write tools available in "analysis" state
And: State transitions are tracked per user session
And: All usage and transitions are properly audited
```
**Workflow State Progression:**
```
Scenario: Complete workflow execution
Given: Request with groups ["knowledge", "compute"] and state "undefined"
When: Agent executes knowledge-query tool (transitions to "analysis")
And: Agent executes complex-analysis tool (transitions to "results")
And: Agent executes reset-workflow tool (transitions to "undefined")
Then: Each step has correctly filtered available tools
And: State transitions are logged with timestamps
And: Final state allows initial workflow to repeat
```
### 10. Performance Considerations
#### 10.1 Tool Loading Impact
**Configuration Loading:**
- Group and state metadata loaded once at startup
- Minimal memory overhead per tool (additional fields)
- No impact on tool initialization time
**Request Processing:**
- Combined group+state filtering occurs once per request
- O(n) complexity where n = number of configured tools
- State transitions add minimal overhead (string assignment)
- Negligible impact for typical tool counts (< 100)
#### 10.2 Optimization Strategies
**Pre-computed Tool Sets:**
- Cache tool sets by group+state combination
- Avoid repeated filtering for common group/state patterns
- Memory vs computation tradeoff for frequently used combinations
**Lazy Loading:**
- Load tool implementations only when needed
- Reduce startup time for deployments with many tools
- Dynamic tool registration based on group requirements
### 11. Future Enhancements
#### 11.1 Dynamic Group Assignment
**Context-Aware Grouping:**
- Assign tools to groups based on request context
- Time-based group availability (business hours only)
- Load-based group restrictions (expensive tools during low usage)
#### 11.2 Group Hierarchies
**Nested Group Structure:**
```json
{
"knowledge": {
"read": ["knowledge-query", "entity-search"],
"write": ["graph-update", "entity-create"]
}
}
```
#### 11.3 Tool Recommendations
**Group-Based Suggestions:**
- Suggest optimal tool groups for request types
- Learn from usage patterns to improve recommendations
- Provide fallback groups when preferred tools are unavailable
### 12. Open Questions
1. **Group Validation**: Should invalid group names in requests cause hard failures or warnings?
2. **Group Discovery**: Should the system provide an API to list available groups and their tools?
3. **Dynamic Groups**: Should groups be configurable at runtime or only at startup?
4. **Group Inheritance**: Should tools inherit groups from their parent categories or implementations?
5. **Performance Monitoring**: What additional metrics are needed to track group-based tool usage effectively?
### 13. Conclusion
The tool group system provides:
- **Security**: Fine-grained access control over agent capabilities
- **Performance**: Reduced tool loading and selection overhead
- **Flexibility**: Multi-dimensional tool classification
- **Compatibility**: Seamless integration with existing agent architectures
This system enables TrustGraph deployments to better manage tool access, improve security boundaries, and optimize resource usage while maintaining full backward compatibility with existing configurations and requests.