mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-05-04 21:02:37 +02:00
parent
a8e437fc7f
commit
6c7af8789d
216 changed files with 31360 additions and 1611 deletions
331
docs/tech-specs/cassandra-consolidation.md
Normal file
331
docs/tech-specs/cassandra-consolidation.md
Normal file
|
|
@ -0,0 +1,331 @@
|
|||
# Tech Spec: Cassandra Configuration Consolidation
|
||||
|
||||
**Status:** Draft
|
||||
**Author:** Assistant
|
||||
**Date:** 2024-09-03
|
||||
|
||||
## Overview
|
||||
|
||||
This specification addresses the inconsistent naming and configuration patterns for Cassandra connection parameters across the TrustGraph codebase. Currently, two different parameter naming schemes exist (`cassandra_*` vs `graph_*`), leading to confusion and maintenance complexity.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The codebase currently uses two distinct sets of Cassandra configuration parameters:
|
||||
|
||||
1. **Knowledge/Config/Library modules** use:
|
||||
- `cassandra_host` (list of hosts)
|
||||
- `cassandra_user`
|
||||
- `cassandra_password`
|
||||
|
||||
2. **Graph/Storage modules** use:
|
||||
- `graph_host` (single host, sometimes converted to list)
|
||||
- `graph_username`
|
||||
- `graph_password`
|
||||
|
||||
3. **Inconsistent command-line exposure**:
|
||||
- Some processors (e.g., `kg-store`) don't expose Cassandra settings as command-line arguments
|
||||
- Other processors expose them with different names and formats
|
||||
- Help text doesn't reflect environment variable defaults
|
||||
|
||||
Both parameter sets connect to the same Cassandra cluster but with different naming conventions, causing:
|
||||
- Configuration confusion for users
|
||||
- Increased maintenance burden
|
||||
- Inconsistent documentation
|
||||
- Potential for misconfiguration
|
||||
- Inability to override settings via command-line in some processors
|
||||
|
||||
## Proposed Solution
|
||||
|
||||
### 1. Standardize Parameter Names
|
||||
|
||||
All modules will use consistent `cassandra_*` parameter names:
|
||||
- `cassandra_host` - List of hosts (internally stored as list)
|
||||
- `cassandra_username` - Username for authentication
|
||||
- `cassandra_password` - Password for authentication
|
||||
|
||||
### 2. Command-Line Arguments
|
||||
|
||||
All processors MUST expose Cassandra configuration via command-line arguments:
|
||||
- `--cassandra-host` - Comma-separated list of hosts
|
||||
- `--cassandra-username` - Username for authentication
|
||||
- `--cassandra-password` - Password for authentication
|
||||
|
||||
### 3. Environment Variable Fallback
|
||||
|
||||
If command-line parameters are not explicitly provided, the system will check environment variables:
|
||||
- `CASSANDRA_HOST` - Comma-separated list of hosts
|
||||
- `CASSANDRA_USERNAME` - Username for authentication
|
||||
- `CASSANDRA_PASSWORD` - Password for authentication
|
||||
|
||||
### 4. Default Values
|
||||
|
||||
If neither command-line parameters nor environment variables are specified:
|
||||
- `cassandra_host` defaults to `["cassandra"]`
|
||||
- `cassandra_username` defaults to `None` (no authentication)
|
||||
- `cassandra_password` defaults to `None` (no authentication)
|
||||
|
||||
### 5. Help Text Requirements
|
||||
|
||||
The `--help` output must:
|
||||
- Show environment variable values as defaults when set
|
||||
- Never display password values (show `****` or `<set>` instead)
|
||||
- Clearly indicate the resolution order in help text
|
||||
|
||||
Example help output:
|
||||
```
|
||||
--cassandra-host HOST
|
||||
Cassandra host list, comma-separated (default: prod-cluster-1,prod-cluster-2)
|
||||
[from CASSANDRA_HOST environment variable]
|
||||
|
||||
--cassandra-username USERNAME
|
||||
Cassandra username (default: cassandra_user)
|
||||
[from CASSANDRA_USERNAME environment variable]
|
||||
|
||||
--cassandra-password PASSWORD
|
||||
Cassandra password (default: <set from environment>)
|
||||
```
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Parameter Resolution Order
|
||||
|
||||
For each Cassandra parameter, the resolution order will be:
|
||||
1. Command-line argument value
|
||||
2. Environment variable (`CASSANDRA_*`)
|
||||
3. Default value
|
||||
|
||||
### Host Parameter Handling
|
||||
|
||||
The `cassandra_host` parameter:
|
||||
- Command-line accepts comma-separated string: `--cassandra-host "host1,host2,host3"`
|
||||
- Environment variable accepts comma-separated string: `CASSANDRA_HOST="host1,host2,host3"`
|
||||
- Internally always stored as list: `["host1", "host2", "host3"]`
|
||||
- Single host: `"localhost"` → converted to `["localhost"]`
|
||||
- Already a list: `["host1", "host2"]` → used as-is
|
||||
|
||||
### Authentication Logic
|
||||
|
||||
Authentication will be used when both `cassandra_username` and `cassandra_password` are provided:
|
||||
```python
|
||||
if cassandra_username and cassandra_password:
|
||||
# Use SSL context and PlainTextAuthProvider
|
||||
else:
|
||||
# Connect without authentication
|
||||
```
|
||||
|
||||
## Files to Modify
|
||||
|
||||
### Modules using `graph_*` parameters (to be changed):
|
||||
- `trustgraph-flow/trustgraph/storage/triples/cassandra/write.py`
|
||||
- `trustgraph-flow/trustgraph/storage/objects/cassandra/write.py`
|
||||
- `trustgraph-flow/trustgraph/storage/rows/cassandra/write.py`
|
||||
- `trustgraph-flow/trustgraph/query/triples/cassandra/service.py`
|
||||
|
||||
### Modules using `cassandra_*` parameters (to be updated with env fallback):
|
||||
- `trustgraph-flow/trustgraph/tables/config.py`
|
||||
- `trustgraph-flow/trustgraph/tables/knowledge.py`
|
||||
- `trustgraph-flow/trustgraph/tables/library.py`
|
||||
- `trustgraph-flow/trustgraph/storage/knowledge/store.py`
|
||||
- `trustgraph-flow/trustgraph/cores/knowledge.py`
|
||||
- `trustgraph-flow/trustgraph/librarian/librarian.py`
|
||||
- `trustgraph-flow/trustgraph/librarian/service.py`
|
||||
- `trustgraph-flow/trustgraph/config/service/service.py`
|
||||
- `trustgraph-flow/trustgraph/cores/service.py`
|
||||
|
||||
### Test Files to Update:
|
||||
- `tests/unit/test_cores/test_knowledge_manager.py`
|
||||
- `tests/unit/test_storage/test_triples_cassandra_storage.py`
|
||||
- `tests/unit/test_query/test_triples_cassandra_query.py`
|
||||
- `tests/integration/test_objects_cassandra_integration.py`
|
||||
|
||||
## Implementation Strategy
|
||||
|
||||
### Phase 1: Create Common Configuration Helper
|
||||
Create utility functions to standardize Cassandra configuration across all processors:
|
||||
|
||||
```python
|
||||
import os
|
||||
import argparse
|
||||
|
||||
def get_cassandra_defaults():
|
||||
"""Get default values from environment variables or fallback."""
|
||||
return {
|
||||
'host': os.getenv('CASSANDRA_HOST', 'cassandra'),
|
||||
'username': os.getenv('CASSANDRA_USERNAME'),
|
||||
'password': os.getenv('CASSANDRA_PASSWORD')
|
||||
}
|
||||
|
||||
def add_cassandra_args(parser: argparse.ArgumentParser):
|
||||
"""
|
||||
Add standardized Cassandra arguments to an argument parser.
|
||||
Shows environment variable values in help text.
|
||||
"""
|
||||
defaults = get_cassandra_defaults()
|
||||
|
||||
# Format help text with env var indication
|
||||
host_help = f"Cassandra host list, comma-separated (default: {defaults['host']})"
|
||||
if 'CASSANDRA_HOST' in os.environ:
|
||||
host_help += " [from CASSANDRA_HOST]"
|
||||
|
||||
username_help = f"Cassandra username"
|
||||
if defaults['username']:
|
||||
username_help += f" (default: {defaults['username']})"
|
||||
if 'CASSANDRA_USERNAME' in os.environ:
|
||||
username_help += " [from CASSANDRA_USERNAME]"
|
||||
|
||||
password_help = "Cassandra password"
|
||||
if defaults['password']:
|
||||
password_help += " (default: <set>)"
|
||||
if 'CASSANDRA_PASSWORD' in os.environ:
|
||||
password_help += " [from CASSANDRA_PASSWORD]"
|
||||
|
||||
parser.add_argument(
|
||||
'--cassandra-host',
|
||||
default=defaults['host'],
|
||||
help=host_help
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--cassandra-username',
|
||||
default=defaults['username'],
|
||||
help=username_help
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--cassandra-password',
|
||||
default=defaults['password'],
|
||||
help=password_help
|
||||
)
|
||||
|
||||
def resolve_cassandra_config(args) -> tuple[list[str], str|None, str|None]:
|
||||
"""
|
||||
Convert argparse args to Cassandra configuration.
|
||||
|
||||
Returns:
|
||||
tuple: (hosts_list, username, password)
|
||||
"""
|
||||
# Convert host string to list
|
||||
if isinstance(args.cassandra_host, str):
|
||||
hosts = [h.strip() for h in args.cassandra_host.split(',')]
|
||||
else:
|
||||
hosts = args.cassandra_host
|
||||
|
||||
return hosts, args.cassandra_username, args.cassandra_password
|
||||
```
|
||||
|
||||
### Phase 2: Update Modules Using `graph_*` Parameters
|
||||
1. Change parameter names from `graph_*` to `cassandra_*`
|
||||
2. Replace custom `add_args()` methods with standardized `add_cassandra_args()`
|
||||
3. Use the common configuration helper functions
|
||||
4. Update documentation strings
|
||||
|
||||
Example transformation:
|
||||
```python
|
||||
# OLD CODE
|
||||
@staticmethod
|
||||
def add_args(parser):
|
||||
parser.add_argument(
|
||||
'-g', '--graph-host',
|
||||
default="localhost",
|
||||
help=f'Graph host (default: localhost)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--graph-username',
|
||||
default=None,
|
||||
help=f'Cassandra username'
|
||||
)
|
||||
|
||||
# NEW CODE
|
||||
@staticmethod
|
||||
def add_args(parser):
|
||||
FlowProcessor.add_args(parser)
|
||||
add_cassandra_args(parser) # Use standard helper
|
||||
```
|
||||
|
||||
### Phase 3: Update Modules Using `cassandra_*` Parameters
|
||||
1. Add command-line argument support where missing (e.g., `kg-store`)
|
||||
2. Replace existing argument definitions with `add_cassandra_args()`
|
||||
3. Use `resolve_cassandra_config()` for consistent resolution
|
||||
4. Ensure consistent host list handling
|
||||
|
||||
### Phase 4: Update Tests and Documentation
|
||||
1. Update all test files to use new parameter names
|
||||
2. Update CLI documentation
|
||||
3. Update API documentation
|
||||
4. Add environment variable documentation
|
||||
|
||||
## Backward Compatibility
|
||||
|
||||
To maintain backward compatibility during transition:
|
||||
|
||||
1. **Deprecation warnings** for `graph_*` parameters
|
||||
2. **Parameter aliasing** - accept both old and new names initially
|
||||
3. **Phased rollout** over multiple releases
|
||||
4. **Documentation updates** with migration guide
|
||||
|
||||
Example backward compatibility code:
|
||||
```python
|
||||
def __init__(self, **params):
|
||||
# Handle deprecated graph_* parameters
|
||||
if 'graph_host' in params:
|
||||
warnings.warn("graph_host is deprecated, use cassandra_host", DeprecationWarning)
|
||||
params.setdefault('cassandra_host', params.pop('graph_host'))
|
||||
|
||||
if 'graph_username' in params:
|
||||
warnings.warn("graph_username is deprecated, use cassandra_username", DeprecationWarning)
|
||||
params.setdefault('cassandra_username', params.pop('graph_username'))
|
||||
|
||||
# ... continue with standard resolution
|
||||
```
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
1. **Unit tests** for configuration resolution logic
|
||||
2. **Integration tests** with various configuration combinations
|
||||
3. **Environment variable tests**
|
||||
4. **Backward compatibility tests** with deprecated parameters
|
||||
5. **Docker compose tests** with environment variables
|
||||
|
||||
## Documentation Updates
|
||||
|
||||
1. Update all CLI command documentation
|
||||
2. Update API documentation
|
||||
3. Create migration guide
|
||||
4. Update Docker compose examples
|
||||
5. Update configuration reference documentation
|
||||
|
||||
## Risks and Mitigation
|
||||
|
||||
| Risk | Impact | Mitigation |
|
||||
|------|--------|------------|
|
||||
| Breaking changes for users | High | Implement backward compatibility period |
|
||||
| Configuration confusion during transition | Medium | Clear documentation and deprecation warnings |
|
||||
| Test failures | Medium | Comprehensive test updates |
|
||||
| Docker deployment issues | High | Update all Docker compose examples |
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- [ ] All modules use consistent `cassandra_*` parameter names
|
||||
- [ ] All processors expose Cassandra settings via command-line arguments
|
||||
- [ ] Command-line help text shows environment variable defaults
|
||||
- [ ] Password values are never displayed in help text
|
||||
- [ ] Environment variable fallback works correctly
|
||||
- [ ] `cassandra_host` is consistently handled as a list internally
|
||||
- [ ] Backward compatibility maintained for at least 2 releases
|
||||
- [ ] All tests pass with new configuration system
|
||||
- [ ] Documentation fully updated
|
||||
- [ ] Docker compose examples work with environment variables
|
||||
|
||||
## Timeline
|
||||
|
||||
- **Week 1:** Implement common configuration helper and update `graph_*` modules
|
||||
- **Week 2:** Add environment variable support to existing `cassandra_*` modules
|
||||
- **Week 3:** Update tests and documentation
|
||||
- **Week 4:** Integration testing and bug fixes
|
||||
|
||||
## Future Considerations
|
||||
|
||||
- Consider extending this pattern to other database configurations (e.g., Elasticsearch)
|
||||
- Implement configuration validation and better error messages
|
||||
- Add support for Cassandra connection pooling configuration
|
||||
- Consider adding configuration file support (.env files)
|
||||
582
docs/tech-specs/cassandra-performance-refactor.md
Normal file
582
docs/tech-specs/cassandra-performance-refactor.md
Normal file
|
|
@ -0,0 +1,582 @@
|
|||
# Tech Spec: Cassandra Knowledge Base Performance Refactor
|
||||
|
||||
**Status:** Draft
|
||||
**Author:** Assistant
|
||||
**Date:** 2025-09-18
|
||||
|
||||
## Overview
|
||||
|
||||
This specification addresses performance issues in the TrustGraph Cassandra knowledge base implementation and proposes optimizations for RDF triple storage and querying.
|
||||
|
||||
## Current Implementation
|
||||
|
||||
### Schema Design
|
||||
|
||||
The current implementation uses a single table design in `trustgraph-flow/trustgraph/direct/cassandra_kg.py`:
|
||||
|
||||
```sql
|
||||
CREATE TABLE triples (
|
||||
collection text,
|
||||
s text,
|
||||
p text,
|
||||
o text,
|
||||
PRIMARY KEY (collection, s, p, o)
|
||||
);
|
||||
```
|
||||
|
||||
**Secondary Indexes:**
|
||||
- `triples_s` ON `s` (subject)
|
||||
- `triples_p` ON `p` (predicate)
|
||||
- `triples_o` ON `o` (object)
|
||||
|
||||
### Query Patterns
|
||||
|
||||
The current implementation supports 8 distinct query patterns:
|
||||
|
||||
1. **get_all(collection, limit=50)** - Retrieve all triples for a collection
|
||||
```sql
|
||||
SELECT s, p, o FROM triples WHERE collection = ? LIMIT 50
|
||||
```
|
||||
|
||||
2. **get_s(collection, s, limit=10)** - Query by subject
|
||||
```sql
|
||||
SELECT p, o FROM triples WHERE collection = ? AND s = ? LIMIT 10
|
||||
```
|
||||
|
||||
3. **get_p(collection, p, limit=10)** - Query by predicate
|
||||
```sql
|
||||
SELECT s, o FROM triples WHERE collection = ? AND p = ? LIMIT 10
|
||||
```
|
||||
|
||||
4. **get_o(collection, o, limit=10)** - Query by object
|
||||
```sql
|
||||
SELECT s, p FROM triples WHERE collection = ? AND o = ? LIMIT 10
|
||||
```
|
||||
|
||||
5. **get_sp(collection, s, p, limit=10)** - Query by subject + predicate
|
||||
```sql
|
||||
SELECT o FROM triples WHERE collection = ? AND s = ? AND p = ? LIMIT 10
|
||||
```
|
||||
|
||||
6. **get_po(collection, p, o, limit=10)** - Query by predicate + object ⚠️
|
||||
```sql
|
||||
SELECT s FROM triples WHERE collection = ? AND p = ? AND o = ? LIMIT 10 ALLOW FILTERING
|
||||
```
|
||||
|
||||
7. **get_os(collection, o, s, limit=10)** - Query by object + subject ⚠️
|
||||
```sql
|
||||
SELECT p FROM triples WHERE collection = ? AND o = ? AND s = ? LIMIT 10 ALLOW FILTERING
|
||||
```
|
||||
|
||||
8. **get_spo(collection, s, p, o, limit=10)** - Exact triple match
|
||||
```sql
|
||||
SELECT s as x FROM triples WHERE collection = ? AND s = ? AND p = ? AND o = ? LIMIT 10
|
||||
```
|
||||
|
||||
### Current Architecture
|
||||
|
||||
**File: `trustgraph-flow/trustgraph/direct/cassandra_kg.py`**
|
||||
- Single `KnowledgeGraph` class handling all operations
|
||||
- Connection pooling through global `_active_clusters` list
|
||||
- Fixed table name: `"triples"`
|
||||
- Keyspace per user model
|
||||
- SimpleStrategy replication with factor 1
|
||||
|
||||
**Integration Points:**
|
||||
- **Write Path:** `trustgraph-flow/trustgraph/storage/triples/cassandra/write.py`
|
||||
- **Query Path:** `trustgraph-flow/trustgraph/query/triples/cassandra/service.py`
|
||||
- **Knowledge Store:** `trustgraph-flow/trustgraph/tables/knowledge.py`
|
||||
|
||||
## Performance Issues Identified
|
||||
|
||||
### Schema-Level Issues
|
||||
|
||||
1. **Inefficient Primary Key Design**
|
||||
- Current: `PRIMARY KEY (collection, s, p, o)`
|
||||
- Results in poor clustering for common access patterns
|
||||
- Forces expensive secondary index usage
|
||||
|
||||
2. **Secondary Index Overuse** ⚠️
|
||||
- Three secondary indexes on high-cardinality columns (s, p, o)
|
||||
- Secondary indexes in Cassandra are expensive and don't scale well
|
||||
- Queries 6 & 7 require `ALLOW FILTERING` indicating poor data modeling
|
||||
|
||||
3. **Hot Partition Risk**
|
||||
- Single partition key `collection` can create hot partitions
|
||||
- Large collections will concentrate on single nodes
|
||||
- No distribution strategy for load balancing
|
||||
|
||||
### Query-Level Issues
|
||||
|
||||
1. **ALLOW FILTERING Usage** ⚠️
|
||||
- Two query types (get_po, get_os) require `ALLOW FILTERING`
|
||||
- These queries scan multiple partitions and are extremely expensive
|
||||
- Performance degrades linearly with data size
|
||||
|
||||
2. **Inefficient Access Patterns**
|
||||
- No optimization for common RDF query patterns
|
||||
- Missing compound indexes for frequent query combinations
|
||||
- No consideration for graph traversal patterns
|
||||
|
||||
3. **Lack of Query Optimization**
|
||||
- No prepared statements caching
|
||||
- No query hints or optimization strategies
|
||||
- No consideration for pagination beyond simple LIMIT
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The current Cassandra knowledge base implementation has two critical performance bottlenecks:
|
||||
|
||||
### 1. Inefficient get_po Query Performance
|
||||
|
||||
The `get_po(collection, p, o)` query is extremely inefficient due to requiring `ALLOW FILTERING`:
|
||||
|
||||
```sql
|
||||
SELECT s FROM triples WHERE collection = ? AND p = ? AND o = ? LIMIT 10 ALLOW FILTERING
|
||||
```
|
||||
|
||||
**Why this is problematic:**
|
||||
- `ALLOW FILTERING` forces Cassandra to scan all partitions within the collection
|
||||
- Performance degrades linearly with data size
|
||||
- This is a common RDF query pattern (finding subjects that have a specific predicate-object relationship)
|
||||
- Creates significant load on the cluster as data grows
|
||||
|
||||
### 2. Poor Clustering Strategy
|
||||
|
||||
The current primary key `PRIMARY KEY (collection, s, p, o)` provides minimal clustering benefits:
|
||||
|
||||
**Issues with current clustering:**
|
||||
- `collection` as partition key doesn't distribute data effectively
|
||||
- Most collections contain diverse data making clustering ineffective
|
||||
- No consideration for common access patterns in RDF queries
|
||||
- Large collections create hot partitions on single nodes
|
||||
- Clustering columns (s, p, o) don't optimize for typical graph traversal patterns
|
||||
|
||||
**Impact:**
|
||||
- Queries don't benefit from data locality
|
||||
- Poor cache utilization
|
||||
- Uneven load distribution across cluster nodes
|
||||
- Scalability bottlenecks as collections grow
|
||||
|
||||
## Proposed Solution: Multi-Table Denormalization Strategy
|
||||
|
||||
### Overview
|
||||
|
||||
Replace the single `triples` table with three purpose-built tables, each optimized for specific query patterns. This eliminates the need for secondary indexes and ALLOW FILTERING while providing optimal performance for all query types.
|
||||
|
||||
### New Schema Design
|
||||
|
||||
**Table 1: Subject-Centric Queries**
|
||||
```sql
|
||||
CREATE TABLE triples_by_subject (
|
||||
collection text,
|
||||
s text,
|
||||
p text,
|
||||
o text,
|
||||
PRIMARY KEY ((collection, s), p, o)
|
||||
);
|
||||
```
|
||||
- **Optimizes:** get_s, get_sp, get_spo, get_os
|
||||
- **Partition Key:** (collection, s) - Better distribution than collection alone
|
||||
- **Clustering:** (p, o) - Enables efficient predicate/object lookups for a subject
|
||||
|
||||
**Table 2: Predicate-Object Queries**
|
||||
```sql
|
||||
CREATE TABLE triples_by_po (
|
||||
collection text,
|
||||
p text,
|
||||
o text,
|
||||
s text,
|
||||
PRIMARY KEY ((collection, p), o, s)
|
||||
);
|
||||
```
|
||||
- **Optimizes:** get_p, get_po (eliminates ALLOW FILTERING!)
|
||||
- **Partition Key:** (collection, p) - Direct access by predicate
|
||||
- **Clustering:** (o, s) - Efficient object-subject traversal
|
||||
|
||||
**Table 3: Object-Centric Queries**
|
||||
```sql
|
||||
CREATE TABLE triples_by_object (
|
||||
collection text,
|
||||
o text,
|
||||
s text,
|
||||
p text,
|
||||
PRIMARY KEY ((collection, o), s, p)
|
||||
);
|
||||
```
|
||||
- **Optimizes:** get_o, get_os
|
||||
- **Partition Key:** (collection, o) - Direct access by object
|
||||
- **Clustering:** (s, p) - Efficient subject-predicate traversal
|
||||
|
||||
### Query Mapping
|
||||
|
||||
| Original Query | Target Table | Performance Improvement |
|
||||
|----------------|-------------|------------------------|
|
||||
| get_all(collection) | triples_by_subject | Token-based pagination |
|
||||
| get_s(collection, s) | triples_by_subject | Direct partition access |
|
||||
| get_p(collection, p) | triples_by_po | Direct partition access |
|
||||
| get_o(collection, o) | triples_by_object | Direct partition access |
|
||||
| get_sp(collection, s, p) | triples_by_subject | Partition + clustering |
|
||||
| get_po(collection, p, o) | triples_by_po | **No more ALLOW FILTERING!** |
|
||||
| get_os(collection, o, s) | triples_by_subject | Partition + clustering |
|
||||
| get_spo(collection, s, p, o) | triples_by_subject | Exact key lookup |
|
||||
|
||||
### Benefits
|
||||
|
||||
1. **Eliminates ALLOW FILTERING** - Every query has an optimal access path
|
||||
2. **No Secondary Indexes** - Each table IS the index for its query pattern
|
||||
3. **Better Data Distribution** - Composite partition keys spread load effectively
|
||||
4. **Predictable Performance** - Query time proportional to result size, not total data
|
||||
5. **Leverages Cassandra Strengths** - Designed for Cassandra's architecture
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Files Requiring Changes
|
||||
|
||||
#### Primary Implementation File
|
||||
|
||||
**`trustgraph-flow/trustgraph/direct/cassandra_kg.py`** - Complete rewrite required
|
||||
|
||||
**Current Methods to Refactor:**
|
||||
```python
|
||||
# Schema initialization
|
||||
def init(self) -> None # Replace single table with three tables
|
||||
|
||||
# Insert operations
|
||||
def insert(self, collection, s, p, o) -> None # Write to all three tables
|
||||
|
||||
# Query operations (API unchanged, implementation optimized)
|
||||
def get_all(self, collection, limit=50) # Use triples_by_subject
|
||||
def get_s(self, collection, s, limit=10) # Use triples_by_subject
|
||||
def get_p(self, collection, p, limit=10) # Use triples_by_po
|
||||
def get_o(self, collection, o, limit=10) # Use triples_by_object
|
||||
def get_sp(self, collection, s, p, limit=10) # Use triples_by_subject
|
||||
def get_po(self, collection, p, o, limit=10) # Use triples_by_po (NO ALLOW FILTERING!)
|
||||
def get_os(self, collection, o, s, limit=10) # Use triples_by_subject
|
||||
def get_spo(self, collection, s, p, o, limit=10) # Use triples_by_subject
|
||||
|
||||
# Collection management
|
||||
def delete_collection(self, collection) -> None # Delete from all three tables
|
||||
```
|
||||
|
||||
#### Integration Files (No Logic Changes Required)
|
||||
|
||||
**`trustgraph-flow/trustgraph/storage/triples/cassandra/write.py`**
|
||||
- No changes needed - uses existing KnowledgeGraph API
|
||||
- Benefits automatically from performance improvements
|
||||
|
||||
**`trustgraph-flow/trustgraph/query/triples/cassandra/service.py`**
|
||||
- No changes needed - uses existing KnowledgeGraph API
|
||||
- Benefits automatically from performance improvements
|
||||
|
||||
### Test Files Requiring Updates
|
||||
|
||||
#### Unit Tests
|
||||
**`tests/unit/test_storage/test_triples_cassandra_storage.py`**
|
||||
- Update test expectations for schema changes
|
||||
- Add tests for multi-table consistency
|
||||
- Verify no ALLOW FILTERING in query plans
|
||||
|
||||
**`tests/unit/test_query/test_triples_cassandra_query.py`**
|
||||
- Update performance assertions
|
||||
- Test all 8 query patterns against new tables
|
||||
- Verify query routing to correct tables
|
||||
|
||||
#### Integration Tests
|
||||
**`tests/integration/test_cassandra_integration.py`**
|
||||
- End-to-end testing with new schema
|
||||
- Performance benchmarking comparisons
|
||||
- Data consistency verification across tables
|
||||
|
||||
**`tests/unit/test_storage/test_cassandra_config_integration.py`**
|
||||
- Update schema validation tests
|
||||
- Test migration scenarios
|
||||
|
||||
### Implementation Strategy
|
||||
|
||||
#### Phase 1: Schema and Core Methods
|
||||
1. **Rewrite `init()` method** - Create three tables instead of one
|
||||
2. **Rewrite `insert()` method** - Batch writes to all three tables
|
||||
3. **Implement prepared statements** - For optimal performance
|
||||
4. **Add table routing logic** - Direct queries to optimal tables
|
||||
|
||||
#### Phase 2: Query Method Optimization
|
||||
1. **Rewrite each get_* method** to use optimal table
|
||||
2. **Remove all ALLOW FILTERING** usage
|
||||
3. **Implement efficient clustering key usage**
|
||||
4. **Add query performance logging**
|
||||
|
||||
#### Phase 3: Collection Management
|
||||
1. **Update `delete_collection()`** - Remove from all three tables
|
||||
2. **Add consistency verification** - Ensure all tables stay in sync
|
||||
3. **Implement batch operations** - For atomic multi-table operations
|
||||
|
||||
### Key Implementation Details
|
||||
|
||||
#### Batch Write Strategy
|
||||
```python
|
||||
def insert(self, collection, s, p, o):
|
||||
batch = BatchStatement()
|
||||
|
||||
# Insert into all three tables
|
||||
batch.add(SimpleStatement(
|
||||
"INSERT INTO triples_by_subject (collection, s, p, o) VALUES (?, ?, ?, ?)"
|
||||
), (collection, s, p, o))
|
||||
|
||||
batch.add(SimpleStatement(
|
||||
"INSERT INTO triples_by_po (collection, p, o, s) VALUES (?, ?, ?, ?)"
|
||||
), (collection, p, o, s))
|
||||
|
||||
batch.add(SimpleStatement(
|
||||
"INSERT INTO triples_by_object (collection, o, s, p) VALUES (?, ?, ?, ?)"
|
||||
), (collection, o, s, p))
|
||||
|
||||
self.session.execute(batch)
|
||||
```
|
||||
|
||||
#### Query Routing Logic
|
||||
```python
|
||||
def get_po(self, collection, p, o, limit=10):
|
||||
# Route to triples_by_po table - NO ALLOW FILTERING!
|
||||
return self.session.execute(
|
||||
"SELECT s FROM triples_by_po WHERE collection = ? AND p = ? AND o = ? LIMIT ?",
|
||||
(collection, p, o, limit)
|
||||
)
|
||||
```
|
||||
|
||||
#### Prepared Statement Optimization
|
||||
```python
|
||||
def prepare_statements(self):
|
||||
# Cache prepared statements for better performance
|
||||
self.insert_subject_stmt = self.session.prepare(
|
||||
"INSERT INTO triples_by_subject (collection, s, p, o) VALUES (?, ?, ?, ?)"
|
||||
)
|
||||
self.insert_po_stmt = self.session.prepare(
|
||||
"INSERT INTO triples_by_po (collection, p, o, s) VALUES (?, ?, ?, ?)"
|
||||
)
|
||||
# ... etc for all tables and queries
|
||||
```
|
||||
|
||||
## Migration Strategy
|
||||
|
||||
### Data Migration Approach
|
||||
|
||||
#### Option 1: Blue-Green Deployment (Recommended)
|
||||
1. **Deploy new schema alongside existing** - Use different table names temporarily
|
||||
2. **Dual-write period** - Write to both old and new schemas during transition
|
||||
3. **Background migration** - Copy existing data to new tables
|
||||
4. **Switch reads** - Route queries to new tables once data is migrated
|
||||
5. **Drop old tables** - After verification period
|
||||
|
||||
#### Option 2: In-Place Migration
|
||||
1. **Schema addition** - Create new tables in existing keyspace
|
||||
2. **Data migration script** - Batch copy from old table to new tables
|
||||
3. **Application update** - Deploy new code after migration completes
|
||||
4. **Old table cleanup** - Remove old table and indexes
|
||||
|
||||
### Backward Compatibility
|
||||
|
||||
#### Deployment Strategy
|
||||
```python
|
||||
# Environment variable to control table usage during migration
|
||||
USE_LEGACY_TABLES = os.getenv('CASSANDRA_USE_LEGACY', 'false').lower() == 'true'
|
||||
|
||||
class KnowledgeGraph:
|
||||
def __init__(self, ...):
|
||||
if USE_LEGACY_TABLES:
|
||||
self.init_legacy_schema()
|
||||
else:
|
||||
self.init_optimized_schema()
|
||||
```
|
||||
|
||||
#### Migration Script
|
||||
```python
|
||||
def migrate_data():
|
||||
# Read from old table
|
||||
old_triples = session.execute("SELECT collection, s, p, o FROM triples")
|
||||
|
||||
# Batch write to new tables
|
||||
for batch in batched(old_triples, 100):
|
||||
batch_stmt = BatchStatement()
|
||||
for row in batch:
|
||||
# Add to all three new tables
|
||||
batch_stmt.add(insert_subject_stmt, row)
|
||||
batch_stmt.add(insert_po_stmt, (row.collection, row.p, row.o, row.s))
|
||||
batch_stmt.add(insert_object_stmt, (row.collection, row.o, row.s, row.p))
|
||||
session.execute(batch_stmt)
|
||||
```
|
||||
|
||||
### Validation Strategy
|
||||
|
||||
#### Data Consistency Checks
|
||||
```python
|
||||
def validate_migration():
|
||||
# Count total records in old vs new tables
|
||||
old_count = session.execute("SELECT COUNT(*) FROM triples WHERE collection = ?", (collection,))
|
||||
new_count = session.execute("SELECT COUNT(*) FROM triples_by_subject WHERE collection = ?", (collection,))
|
||||
|
||||
assert old_count == new_count, f"Record count mismatch: {old_count} vs {new_count}"
|
||||
|
||||
# Spot check random samples
|
||||
sample_queries = generate_test_queries()
|
||||
for query in sample_queries:
|
||||
old_result = execute_legacy_query(query)
|
||||
new_result = execute_optimized_query(query)
|
||||
assert old_result == new_result, f"Query results differ for {query}"
|
||||
```
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Performance Testing
|
||||
|
||||
#### Benchmark Scenarios
|
||||
1. **Query Performance Comparison**
|
||||
- Before/after performance metrics for all 8 query types
|
||||
- Focus on get_po performance improvement (eliminate ALLOW FILTERING)
|
||||
- Measure query latency under various data sizes
|
||||
|
||||
2. **Load Testing**
|
||||
- Concurrent query execution
|
||||
- Write throughput with batch operations
|
||||
- Memory and CPU utilization
|
||||
|
||||
3. **Scalability Testing**
|
||||
- Performance with increasing collection sizes
|
||||
- Multi-collection query distribution
|
||||
- Cluster node utilization
|
||||
|
||||
#### Test Data Sets
|
||||
- **Small:** 10K triples per collection
|
||||
- **Medium:** 100K triples per collection
|
||||
- **Large:** 1M+ triples per collection
|
||||
- **Multiple collections:** Test partition distribution
|
||||
|
||||
### Functional Testing
|
||||
|
||||
#### Unit Test Updates
|
||||
```python
|
||||
# Example test structure for new implementation
|
||||
class TestCassandraKGPerformance:
|
||||
def test_get_po_no_allow_filtering(self):
|
||||
# Verify get_po queries don't use ALLOW FILTERING
|
||||
with patch('cassandra.cluster.Session.execute') as mock_execute:
|
||||
kg.get_po('test_collection', 'predicate', 'object')
|
||||
executed_query = mock_execute.call_args[0][0]
|
||||
assert 'ALLOW FILTERING' not in executed_query
|
||||
|
||||
def test_multi_table_consistency(self):
|
||||
# Verify all tables stay in sync
|
||||
kg.insert('test', 's1', 'p1', 'o1')
|
||||
|
||||
# Check all tables contain the triple
|
||||
assert_triple_exists('triples_by_subject', 'test', 's1', 'p1', 'o1')
|
||||
assert_triple_exists('triples_by_po', 'test', 'p1', 'o1', 's1')
|
||||
assert_triple_exists('triples_by_object', 'test', 'o1', 's1', 'p1')
|
||||
```
|
||||
|
||||
#### Integration Test Updates
|
||||
```python
|
||||
class TestCassandraIntegration:
|
||||
def test_query_performance_regression(self):
|
||||
# Ensure new implementation is faster than old
|
||||
old_time = benchmark_legacy_get_po()
|
||||
new_time = benchmark_optimized_get_po()
|
||||
assert new_time < old_time * 0.5 # At least 50% improvement
|
||||
|
||||
def test_end_to_end_workflow(self):
|
||||
# Test complete write -> query -> delete cycle
|
||||
# Verify no performance degradation in integration
|
||||
```
|
||||
|
||||
### Rollback Plan
|
||||
|
||||
#### Quick Rollback Strategy
|
||||
1. **Environment variable toggle** - Switch back to legacy tables immediately
|
||||
2. **Keep legacy tables** - Don't drop until performance is proven
|
||||
3. **Monitoring alerts** - Automated rollback triggers based on error rates/latency
|
||||
|
||||
#### Rollback Validation
|
||||
```python
|
||||
def rollback_to_legacy():
|
||||
# Set environment variable
|
||||
os.environ['CASSANDRA_USE_LEGACY'] = 'true'
|
||||
|
||||
# Restart services to pick up change
|
||||
restart_cassandra_services()
|
||||
|
||||
# Validate functionality
|
||||
run_smoke_tests()
|
||||
```
|
||||
|
||||
## Risks and Considerations
|
||||
|
||||
### Performance Risks
|
||||
- **Write latency increase** - 3x write operations per insert
|
||||
- **Storage overhead** - 3x storage requirement
|
||||
- **Batch write failures** - Need proper error handling
|
||||
|
||||
### Operational Risks
|
||||
- **Migration complexity** - Data migration for large datasets
|
||||
- **Consistency challenges** - Ensuring all tables stay synchronized
|
||||
- **Monitoring gaps** - Need new metrics for multi-table operations
|
||||
|
||||
### Mitigation Strategies
|
||||
1. **Gradual rollout** - Start with small collections
|
||||
2. **Comprehensive monitoring** - Track all performance metrics
|
||||
3. **Automated validation** - Continuous consistency checking
|
||||
4. **Quick rollback capability** - Environment-based table selection
|
||||
|
||||
## Success Criteria
|
||||
|
||||
### Performance Improvements
|
||||
- [ ] **Eliminate ALLOW FILTERING** - get_po and get_os queries run without filtering
|
||||
- [ ] **Query latency reduction** - 50%+ improvement in query response times
|
||||
- [ ] **Better load distribution** - No hot partitions, even load across cluster nodes
|
||||
- [ ] **Scalable performance** - Query time proportional to result size, not total data
|
||||
|
||||
### Functional Requirements
|
||||
- [ ] **API compatibility** - All existing code continues to work unchanged
|
||||
- [ ] **Data consistency** - All three tables remain synchronized
|
||||
- [ ] **Zero data loss** - Migration preserves all existing triples
|
||||
- [ ] **Backward compatibility** - Ability to rollback to legacy schema
|
||||
|
||||
### Operational Requirements
|
||||
- [ ] **Safe migration** - Blue-green deployment with rollback capability
|
||||
- [ ] **Monitoring coverage** - Comprehensive metrics for multi-table operations
|
||||
- [ ] **Test coverage** - All query patterns tested with performance benchmarks
|
||||
- [ ] **Documentation** - Updated deployment and operational procedures
|
||||
|
||||
## Timeline
|
||||
|
||||
### Phase 1: Implementation
|
||||
- [ ] Rewrite `cassandra_kg.py` with multi-table schema
|
||||
- [ ] Implement batch write operations
|
||||
- [ ] Add prepared statement optimization
|
||||
- [ ] Update unit tests
|
||||
|
||||
### Phase 2: Integration Testing
|
||||
- [ ] Update integration tests
|
||||
- [ ] Performance benchmarking
|
||||
- [ ] Load testing with realistic data volumes
|
||||
- [ ] Validation scripts for data consistency
|
||||
|
||||
### Phase 3: Migration Planning
|
||||
- [ ] Blue-green deployment scripts
|
||||
- [ ] Data migration tools
|
||||
- [ ] Monitoring dashboard updates
|
||||
- [ ] Rollback procedures
|
||||
|
||||
### Phase 4: Production Deployment
|
||||
- [ ] Staged rollout to production
|
||||
- [ ] Performance monitoring and validation
|
||||
- [ ] Legacy table cleanup
|
||||
- [ ] Documentation updates
|
||||
|
||||
## Conclusion
|
||||
|
||||
This multi-table denormalization strategy directly addresses the two critical performance bottlenecks:
|
||||
|
||||
1. **Eliminates expensive ALLOW FILTERING** by providing optimal table structures for each query pattern
|
||||
2. **Improves clustering effectiveness** through composite partition keys that distribute load properly
|
||||
|
||||
The approach leverages Cassandra's strengths while maintaining complete API compatibility, ensuring existing code benefits automatically from the performance improvements.
|
||||
349
docs/tech-specs/collection-management.md
Normal file
349
docs/tech-specs/collection-management.md
Normal file
|
|
@ -0,0 +1,349 @@
|
|||
# Collection Management Technical Specification
|
||||
|
||||
## Overview
|
||||
|
||||
This specification describes the collection management capabilities for TrustGraph, enabling users to have explicit control over collections that are currently implicitly created during data loading and querying operations. The feature supports four primary use cases:
|
||||
|
||||
1. **Collection Listing**: View all existing collections in the system
|
||||
2. **Collection Deletion**: Remove unwanted collections and their associated data
|
||||
3. **Collection Labeling**: Associate descriptive labels with collections for better organization
|
||||
4. **Collection Tagging**: Apply tags to collections for categorization and easier discovery
|
||||
|
||||
## Goals
|
||||
|
||||
- **Explicit Collection Control**: Provide users with direct management capabilities over collections beyond implicit creation
|
||||
- **Collection Visibility**: Enable users to list and inspect all collections in their environment
|
||||
- **Collection Cleanup**: Allow deletion of collections that are no longer needed
|
||||
- **Collection Organization**: Support labels and tags for better collection tracking and discovery
|
||||
- **Metadata Management**: Associate meaningful metadata with collections for operational clarity
|
||||
- **Collection Discovery**: Make it easier to find specific collections through filtering and search
|
||||
- **Operational Transparency**: Provide clear visibility into collection lifecycle and usage
|
||||
- **Resource Management**: Enable cleanup of unused collections to optimize resource utilization
|
||||
|
||||
## Background
|
||||
|
||||
Currently, collections in TrustGraph are implicitly created during data loading operations and query execution. While this provides convenience for users, it lacks the explicit control needed for production environments and long-term data management.
|
||||
|
||||
Current limitations include:
|
||||
- No way to list existing collections
|
||||
- No mechanism to delete unwanted collections
|
||||
- No ability to associate metadata with collections for tracking purposes
|
||||
- Difficulty in organizing and discovering collections over time
|
||||
|
||||
This specification addresses these gaps by introducing explicit collection management operations. By providing collection management APIs and commands, TrustGraph can:
|
||||
- Give users full control over their collection lifecycle
|
||||
- Enable better organization through labels and tags
|
||||
- Support collection cleanup for resource optimization
|
||||
- Improve operational visibility and management
|
||||
|
||||
## Technical Design
|
||||
|
||||
### Architecture
|
||||
|
||||
The collection management system will be implemented within existing TrustGraph infrastructure:
|
||||
|
||||
1. **Librarian Service Integration**
|
||||
- Collection management operations will be added to the existing librarian service
|
||||
- No new service required - leverages existing authentication and access patterns
|
||||
- Handles collection listing, deletion, and metadata management
|
||||
|
||||
Module: trustgraph-librarian
|
||||
|
||||
2. **Cassandra Collection Metadata Table**
|
||||
- New table in the existing librarian keyspace
|
||||
- Stores collection metadata with user-scoped access
|
||||
- Primary key: (user_id, collection_id) for proper multi-tenancy
|
||||
|
||||
Module: trustgraph-librarian
|
||||
|
||||
3. **Collection Management CLI**
|
||||
- Command-line interface for collection operations
|
||||
- Provides list, delete, label, and tag management commands
|
||||
- Integrates with existing CLI framework
|
||||
|
||||
Module: trustgraph-cli
|
||||
|
||||
### Data Models
|
||||
|
||||
#### Cassandra Collection Metadata Table
|
||||
|
||||
The collection metadata will be stored in a structured Cassandra table in the librarian keyspace:
|
||||
|
||||
```sql
|
||||
CREATE TABLE collections (
|
||||
user text,
|
||||
collection text,
|
||||
name text,
|
||||
description text,
|
||||
tags set<text>,
|
||||
created_at timestamp,
|
||||
updated_at timestamp,
|
||||
PRIMARY KEY (user, collection)
|
||||
);
|
||||
```
|
||||
|
||||
Table structure:
|
||||
- **user** + **collection**: Composite primary key ensuring user isolation
|
||||
- **name**: Human-readable collection name
|
||||
- **description**: Detailed description of collection purpose
|
||||
- **tags**: Set of tags for categorization and filtering
|
||||
- **created_at**: Collection creation timestamp
|
||||
- **updated_at**: Last modification timestamp
|
||||
|
||||
This approach allows:
|
||||
- Multi-tenant collection management with user isolation
|
||||
- Efficient querying by user and collection
|
||||
- Flexible tagging system for organization
|
||||
- Lifecycle tracking for operational insights
|
||||
|
||||
#### Collection Lifecycle
|
||||
|
||||
Collections follow a lazy-creation pattern that aligns with existing TrustGraph behavior:
|
||||
|
||||
1. **Lazy Creation**: Collections are automatically created when first referenced during data loading or query operations. No explicit create operation is needed.
|
||||
|
||||
2. **Implicit Registration**: When a collection is used (data loading, querying), the system checks if a metadata record exists. If not, a new record is created with default values:
|
||||
- `name`: defaults to collection_id
|
||||
- `description`: empty
|
||||
- `tags`: empty set
|
||||
- `created_at`: current timestamp
|
||||
|
||||
3. **Explicit Updates**: Users can update collection metadata (name, description, tags) through management operations after lazy creation.
|
||||
|
||||
4. **Explicit Deletion**: Users can delete collections, which removes both the metadata record and the underlying collection data across all store types.
|
||||
|
||||
5. **Multi-Store Deletion**: Collection deletion cascades across all storage backends (vector stores, object stores, triple stores) as each implements lazy creation and must support collection deletion.
|
||||
|
||||
Operations required:
|
||||
- **Collection Use Notification**: Internal operation triggered during data loading/querying to ensure metadata record exists
|
||||
- **Update Collection Metadata**: User operation to modify name, description, and tags
|
||||
- **Delete Collection**: User operation to remove collection and its data across all stores
|
||||
- **List Collections**: User operation to view collections with filtering by tags
|
||||
|
||||
#### Multi-Store Collection Management
|
||||
|
||||
Collections exist across multiple storage backends in TrustGraph:
|
||||
- **Vector Stores**: Store embeddings and vector data for collections
|
||||
- **Object Stores**: Store documents and file data for collections
|
||||
- **Triple Stores**: Store graph/RDF data for collections
|
||||
|
||||
Each store type implements:
|
||||
- **Lazy Creation**: Collections are created implicitly when data is first stored
|
||||
- **Collection Deletion**: Store-specific deletion operations to remove collection data
|
||||
|
||||
The librarian service coordinates collection operations across all store types, ensuring consistent collection lifecycle management.
|
||||
|
||||
### APIs
|
||||
|
||||
New APIs:
|
||||
- **List Collections**: Retrieve collections for a user with optional tag filtering
|
||||
- **Update Collection Metadata**: Modify collection name, description, and tags
|
||||
- **Delete Collection**: Remove collection and associated data with confirmation, cascading to all store types
|
||||
- **Collection Use Notification** (Internal): Ensure metadata record exists when collection is referenced
|
||||
|
||||
Store Writer APIs (Enhanced):
|
||||
- **Vector Store Collection Deletion**: Remove vector data for specified user and collection
|
||||
- **Object Store Collection Deletion**: Remove object/document data for specified user and collection
|
||||
- **Triple Store Collection Deletion**: Remove graph/RDF data for specified user and collection
|
||||
|
||||
Modified APIs:
|
||||
- **Data Loading APIs**: Enhanced to trigger collection use notification for lazy metadata creation
|
||||
- **Query APIs**: Enhanced to trigger collection use notification and optionally include metadata in responses
|
||||
|
||||
### Implementation Details
|
||||
|
||||
The implementation will follow existing TrustGraph patterns for service integration and CLI command structure.
|
||||
|
||||
#### Collection Deletion Cascade
|
||||
|
||||
When a user initiates collection deletion through the librarian service:
|
||||
|
||||
1. **Metadata Validation**: Verify collection exists and user has permission to delete
|
||||
2. **Store Cascade**: Librarian coordinates deletion across all store writers:
|
||||
- Vector store writer: Remove embeddings and vector indexes for the user and collection
|
||||
- Object store writer: Remove documents and files for the user and collection
|
||||
- Triple store writer: Remove graph data and triples for the user and collection
|
||||
3. **Metadata Cleanup**: Remove collection metadata record from Cassandra
|
||||
4. **Error Handling**: If any store deletion fails, maintain consistency through rollback or retry mechanisms
|
||||
|
||||
#### Collection Management Interface
|
||||
|
||||
All store writers will implement a standardized collection management interface with a common schema across store types:
|
||||
|
||||
**Message Schema:**
|
||||
```json
|
||||
{
|
||||
"operation": "delete-collection",
|
||||
"user": "user123",
|
||||
"collection": "documents-2024",
|
||||
"timestamp": "2024-01-15T10:30:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
**Queue Architecture:**
|
||||
- **Object Store Collection Management Queue**: Handles collection operations for object/document stores
|
||||
- **Vector Store Collection Management Queue**: Handles collection operations for vector/embedding stores
|
||||
- **Triple Store Collection Management Queue**: Handles collection operations for graph/RDF stores
|
||||
|
||||
Each store writer implements:
|
||||
- **Collection Management Handler**: Separate from standard data storage handlers
|
||||
- **Delete Collection Operation**: Removes all data associated with the specified collection
|
||||
- **Message Processing**: Consumes from dedicated collection management queue
|
||||
- **Status Reporting**: Returns success/failure status for coordination
|
||||
- **Idempotent Operations**: Handles cases where collection doesn't exist (no-op)
|
||||
|
||||
**Initial Implementation:**
|
||||
Only `delete-collection` operation will be implemented initially. The interface supports future operations like `archive-collection`, `migrate-collection`, etc.
|
||||
|
||||
#### Cassandra Triple Store Refactor
|
||||
|
||||
As part of this implementation, the Cassandra triple store will be refactored from a table-per-collection model to a unified table model:
|
||||
|
||||
**Current Architecture:**
|
||||
- Keyspace per user, separate table per collection
|
||||
- Schema: `(s, p, o)` with `PRIMARY KEY (s, p, o)`
|
||||
- Table names: user collections become separate Cassandra tables
|
||||
|
||||
**New Architecture:**
|
||||
- Keyspace per user, single "triples" table for all collections
|
||||
- Schema: `(collection, s, p, o)` with `PRIMARY KEY (collection, s, p, o)`
|
||||
- Collection isolation through collection partitioning
|
||||
|
||||
**Changes Required:**
|
||||
|
||||
1. **TrustGraph Class Refactor** (`trustgraph/direct/cassandra.py`):
|
||||
- Remove `table` parameter from constructor, use fixed "triples" table
|
||||
- Add `collection` parameter to all methods
|
||||
- Update schema to include collection as first column
|
||||
- **Index Updates**: New indexes will be created to support all 8 query patterns:
|
||||
- Index on `(s)` for subject-based queries
|
||||
- Index on `(p)` for predicate-based queries
|
||||
- Index on `(o)` for object-based queries
|
||||
- Note: Cassandra doesn't support multi-column secondary indexes, so these are single-column indexes
|
||||
|
||||
- **Query Pattern Performance**:
|
||||
- ✅ `get_all()` - partition scan on `collection`
|
||||
- ✅ `get_s(s)` - uses primary key efficiently (`collection, s`)
|
||||
- ✅ `get_p(p)` - uses `idx_p` with `collection` filtering
|
||||
- ✅ `get_o(o)` - uses `idx_o` with `collection` filtering
|
||||
- ✅ `get_sp(s, p)` - uses primary key efficiently (`collection, s, p`)
|
||||
- ⚠️ `get_po(p, o)` - requires `ALLOW FILTERING` (uses either `idx_p` or `idx_o` plus filtering)
|
||||
- ✅ `get_os(o, s)` - uses `idx_o` with additional filtering on `s`
|
||||
- ✅ `get_spo(s, p, o)` - uses full primary key efficiently
|
||||
|
||||
- **Note on ALLOW FILTERING**: The `get_po` query pattern requires `ALLOW FILTERING` as it needs both predicate and object constraints without a suitable compound index. This is acceptable as this query pattern is less common than subject-based queries in typical triple store usage
|
||||
|
||||
2. **Storage Writer Updates** (`trustgraph/storage/triples/cassandra/write.py`):
|
||||
- Maintain single TrustGraph connection per user instead of per (user, collection)
|
||||
- Pass collection to insert operations
|
||||
- Improved resource utilization with fewer connections
|
||||
|
||||
3. **Query Service Updates** (`trustgraph/query/triples/cassandra/service.py`):
|
||||
- Single TrustGraph connection per user
|
||||
- Pass collection to all query operations
|
||||
- Maintain same query logic with collection parameter
|
||||
|
||||
**Benefits:**
|
||||
- **Simplified Collection Deletion**: Simple `DELETE FROM triples WHERE collection = ?` instead of dropping tables
|
||||
- **Resource Efficiency**: Fewer database connections and table objects
|
||||
- **Cross-Collection Operations**: Easier to implement operations spanning multiple collections
|
||||
- **Consistent Architecture**: Aligns with unified collection metadata approach
|
||||
|
||||
**Migration Strategy:**
|
||||
Existing table-per-collection data will need migration to the new unified schema during the upgrade process.
|
||||
|
||||
Collection operations will be atomic where possible and provide appropriate error handling and validation.
|
||||
|
||||
## Security Considerations
|
||||
|
||||
Collection management operations require appropriate authorization to prevent unauthorized access or deletion of collections. Access control will align with existing TrustGraph security models.
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
Collection listing operations may need pagination for environments with large numbers of collections. Metadata queries should be optimized for common filtering patterns.
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
Comprehensive testing will cover collection lifecycle operations, metadata management, and CLI command functionality with both unit and integration tests.
|
||||
|
||||
## Migration Plan
|
||||
|
||||
This implementation requires both metadata and storage migrations:
|
||||
|
||||
### Collection Metadata Migration
|
||||
Existing collections will need to be registered in the new Cassandra collections metadata table. A migration process will:
|
||||
- Scan existing keyspaces and tables to identify collections
|
||||
- Create metadata records with default values (name=collection_id, empty description/tags)
|
||||
- Preserve creation timestamps where possible
|
||||
|
||||
### Cassandra Triple Store Migration
|
||||
The Cassandra storage refactor requires data migration from table-per-collection to unified table:
|
||||
- **Pre-migration**: Identify all user keyspaces and collection tables
|
||||
- **Data Transfer**: Copy triples from individual collection tables to unified "triples" table with collection
|
||||
- **Schema Validation**: Ensure new primary key structure maintains query performance
|
||||
- **Cleanup**: Remove old collection tables after successful migration
|
||||
- **Rollback Plan**: Maintain ability to restore table-per-collection structure if needed
|
||||
|
||||
Migration will be performed during a maintenance window to ensure data consistency.
|
||||
|
||||
## Implementation Status
|
||||
|
||||
### ✅ Completed Components
|
||||
|
||||
1. **Librarian Collection Management Service** (`trustgraph-flow/trustgraph/librarian/collection_service.py`)
|
||||
- Complete collection CRUD operations (list, update, delete)
|
||||
- Cassandra collection metadata table integration via `LibraryTableStore`
|
||||
- Async request/response handling with proper error management
|
||||
- Collection deletion cascade coordination across all storage types
|
||||
|
||||
2. **Collection Metadata Schema** (`trustgraph-base/trustgraph/schema/services/collection.py`)
|
||||
- `CollectionManagementRequest` and `CollectionManagementResponse` schemas
|
||||
- `CollectionMetadata` schema for collection records
|
||||
- Collection request/response queue topic definitions
|
||||
|
||||
3. **Storage Management Schema** (`trustgraph-base/trustgraph/schema/services/storage.py`)
|
||||
- `StorageManagementRequest` and `StorageManagementResponse` schemas
|
||||
- Message format for storage-level collection operations
|
||||
|
||||
### ❌ Missing Components
|
||||
|
||||
1. **Storage Management Queue Topics**
|
||||
- Missing topic definitions in schema for:
|
||||
- `vector_storage_management_topic`
|
||||
- `object_storage_management_topic`
|
||||
- `triples_storage_management_topic`
|
||||
- `storage_management_response_topic`
|
||||
- These are referenced by the librarian service but not yet defined
|
||||
|
||||
2. **Store Collection Management Handlers**
|
||||
- **Vector Store Writers** (Qdrant, Milvus, Pinecone): No collection deletion handlers
|
||||
- **Object Store Writers** (Cassandra): No collection deletion handlers
|
||||
- **Triple Store Writers** (Cassandra, Neo4j, Memgraph, FalkorDB): No collection deletion handlers
|
||||
- Need to implement `StorageManagementRequest` processing in each store writer
|
||||
|
||||
3. **Collection Management Interface Implementation**
|
||||
- Store writers need collection management message consumers
|
||||
- Collection deletion operations need to be implemented per store type
|
||||
- Response handling back to librarian service
|
||||
|
||||
### Next Implementation Steps
|
||||
|
||||
1. **Define Storage Management Topics** in `trustgraph-base/trustgraph/schema/services/storage.py`
|
||||
2. **Implement Collection Management Handlers** in each storage writer:
|
||||
- Add `StorageManagementRequest` consumers
|
||||
- Implement collection deletion operations
|
||||
- Add response producers for status reporting
|
||||
3. **Test End-to-End Collection Deletion** across all storage types
|
||||
|
||||
## Timeline
|
||||
|
||||
Phase 1 (Storage Topics): 1-2 days
|
||||
Phase 2 (Store Handlers): 1-2 weeks depending on number of storage backends
|
||||
Phase 3 (Testing & Integration): 3-5 days
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Should collection deletion be soft or hard delete by default?
|
||||
- What metadata fields should be required vs optional?
|
||||
- Should we implement storage management handlers incrementally by store type?
|
||||
|
||||
156
docs/tech-specs/flow-class-definition.md
Normal file
156
docs/tech-specs/flow-class-definition.md
Normal file
|
|
@ -0,0 +1,156 @@
|
|||
# Flow Class Definition Specification
|
||||
|
||||
## Overview
|
||||
|
||||
A flow class defines a complete dataflow pattern template in the TrustGraph system. When instantiated, it creates an interconnected network of processors that handle data ingestion, processing, storage, and querying as a unified system.
|
||||
|
||||
## Structure
|
||||
|
||||
A flow class definition consists of four main sections:
|
||||
|
||||
### 1. Class Section
|
||||
Defines shared service processors that are instantiated once per flow class. These processors handle requests from all flow instances of this class.
|
||||
|
||||
```json
|
||||
"class": {
|
||||
"service-name:{class}": {
|
||||
"request": "queue-pattern:{class}",
|
||||
"response": "queue-pattern:{class}"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Characteristics:**
|
||||
- Shared across all flow instances of the same class
|
||||
- Typically expensive or stateless services (LLMs, embedding models)
|
||||
- Use `{class}` template variable for queue naming
|
||||
- Examples: `embeddings:{class}`, `text-completion:{class}`, `graph-rag:{class}`
|
||||
|
||||
### 2. Flow Section
|
||||
Defines flow-specific processors that are instantiated for each individual flow instance. Each flow gets its own isolated set of these processors.
|
||||
|
||||
```json
|
||||
"flow": {
|
||||
"processor-name:{id}": {
|
||||
"input": "queue-pattern:{id}",
|
||||
"output": "queue-pattern:{id}"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Characteristics:**
|
||||
- Unique instance per flow
|
||||
- Handle flow-specific data and state
|
||||
- Use `{id}` template variable for queue naming
|
||||
- Examples: `chunker:{id}`, `pdf-decoder:{id}`, `kg-extract-relationships:{id}`
|
||||
|
||||
### 3. Interfaces Section
|
||||
Defines the entry points and interaction contracts for the flow. These form the API surface for external systems and internal component communication.
|
||||
|
||||
Interfaces can take two forms:
|
||||
|
||||
**Fire-and-Forget Pattern** (single queue):
|
||||
```json
|
||||
"interfaces": {
|
||||
"document-load": "persistent://tg/flow/document-load:{id}",
|
||||
"triples-store": "persistent://tg/flow/triples-store:{id}"
|
||||
}
|
||||
```
|
||||
|
||||
**Request/Response Pattern** (object with request/response fields):
|
||||
```json
|
||||
"interfaces": {
|
||||
"embeddings": {
|
||||
"request": "non-persistent://tg/request/embeddings:{class}",
|
||||
"response": "non-persistent://tg/response/embeddings:{class}"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Types of Interfaces:**
|
||||
- **Entry Points**: Where external systems inject data (`document-load`, `agent`)
|
||||
- **Service Interfaces**: Request/response patterns for services (`embeddings`, `text-completion`)
|
||||
- **Data Interfaces**: Fire-and-forget data flow connection points (`triples-store`, `entity-contexts-load`)
|
||||
|
||||
### 4. Metadata
|
||||
Additional information about the flow class:
|
||||
|
||||
```json
|
||||
"description": "Human-readable description",
|
||||
"tags": ["capability-1", "capability-2"]
|
||||
```
|
||||
|
||||
## Template Variables
|
||||
|
||||
### {id}
|
||||
- Replaced with the unique flow instance identifier
|
||||
- Creates isolated resources for each flow
|
||||
- Example: `flow-123`, `customer-A-flow`
|
||||
|
||||
### {class}
|
||||
- Replaced with the flow class name
|
||||
- Creates shared resources across flows of the same class
|
||||
- Example: `standard-rag`, `enterprise-rag`
|
||||
|
||||
## Queue Patterns (Pulsar)
|
||||
|
||||
Flow classes use Apache Pulsar for messaging. Queue names follow the Pulsar format:
|
||||
```
|
||||
<persistence>://<tenant>/<namespace>/<topic>
|
||||
```
|
||||
|
||||
### Components:
|
||||
- **persistence**: `persistent` or `non-persistent` (Pulsar persistence mode)
|
||||
- **tenant**: `tg` for TrustGraph-supplied flow class definitions
|
||||
- **namespace**: Indicates the messaging pattern
|
||||
- `flow`: Fire-and-forget services
|
||||
- `request`: Request portion of request/response services
|
||||
- `response`: Response portion of request/response services
|
||||
- **topic**: The specific queue/topic name with template variables
|
||||
|
||||
### Persistent Queues
|
||||
- Pattern: `persistent://tg/flow/<topic>:{id}`
|
||||
- Used for fire-and-forget services and durable data flow
|
||||
- Data persists in Pulsar storage across restarts
|
||||
- Example: `persistent://tg/flow/chunk-load:{id}`
|
||||
|
||||
### Non-Persistent Queues
|
||||
- Pattern: `non-persistent://tg/request/<topic>:{class}` or `non-persistent://tg/response/<topic>:{class}`
|
||||
- Used for request/response messaging patterns
|
||||
- Ephemeral, not persisted to disk by Pulsar
|
||||
- Lower latency, suitable for RPC-style communication
|
||||
- Example: `non-persistent://tg/request/embeddings:{class}`
|
||||
|
||||
## Dataflow Architecture
|
||||
|
||||
The flow class creates a unified dataflow where:
|
||||
|
||||
1. **Document Processing Pipeline**: Flows from ingestion through transformation to storage
|
||||
2. **Query Services**: Integrated processors that query the same data stores and services
|
||||
3. **Shared Services**: Centralized processors that all flows can utilize
|
||||
4. **Storage Writers**: Persist processed data to appropriate stores
|
||||
|
||||
All processors (both `{id}` and `{class}`) work together as a cohesive dataflow graph, not as separate systems.
|
||||
|
||||
## Example Flow Instantiation
|
||||
|
||||
Given:
|
||||
- Flow Instance ID: `customer-A-flow`
|
||||
- Flow Class: `standard-rag`
|
||||
|
||||
Template expansions:
|
||||
- `persistent://tg/flow/chunk-load:{id}` → `persistent://tg/flow/chunk-load:customer-A-flow`
|
||||
- `non-persistent://tg/request/embeddings:{class}` → `non-persistent://tg/request/embeddings:standard-rag`
|
||||
|
||||
This creates:
|
||||
- Isolated document processing pipeline for `customer-A-flow`
|
||||
- Shared embedding service for all `standard-rag` flows
|
||||
- Complete dataflow from document ingestion through querying
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Resource Efficiency**: Expensive services are shared across flows
|
||||
2. **Flow Isolation**: Each flow has its own data processing pipeline
|
||||
3. **Scalability**: Can instantiate multiple flows from the same template
|
||||
4. **Modularity**: Clear separation between shared and flow-specific components
|
||||
5. **Unified Architecture**: Query and processing are part of the same dataflow
|
||||
383
docs/tech-specs/graphql-query.md
Normal file
383
docs/tech-specs/graphql-query.md
Normal file
|
|
@ -0,0 +1,383 @@
|
|||
# GraphQL Query Technical Specification
|
||||
|
||||
## Overview
|
||||
|
||||
This specification describes the implementation of a GraphQL query interface for TrustGraph's structured data storage in Apache Cassandra. Building upon the structured data capabilities outlined in the structured-data.md specification, this document details how GraphQL queries will be executed against Cassandra tables containing extracted and ingested structured objects.
|
||||
|
||||
The GraphQL query service will provide a flexible, type-safe interface for querying structured data stored in Cassandra. It will dynamically adapt to schema changes, support complex queries including relationships between objects, and integrate seamlessly with TrustGraph's existing message-based architecture.
|
||||
|
||||
## Goals
|
||||
|
||||
- **Dynamic Schema Support**: Automatically adapt to schema changes in configuration without service restarts
|
||||
- **GraphQL Standards Compliance**: Provide a standard GraphQL interface compatible with existing GraphQL tooling and clients
|
||||
- **Efficient Cassandra Queries**: Translate GraphQL queries into efficient Cassandra CQL queries respecting partition keys and indexes
|
||||
- **Relationship Resolution**: Support GraphQL field resolvers for relationships between different object types
|
||||
- **Type Safety**: Ensure type-safe query execution and response generation based on schema definitions
|
||||
- **Scalable Performance**: Handle concurrent queries efficiently with proper connection pooling and query optimization
|
||||
- **Request/Response Integration**: Maintain compatibility with TrustGraph's Pulsar-based request/response pattern
|
||||
- **Error Handling**: Provide comprehensive error reporting for schema mismatches, query errors, and data validation issues
|
||||
|
||||
## Background
|
||||
|
||||
The structured data storage implementation (trustgraph-flow/trustgraph/storage/objects/cassandra/) writes objects to Cassandra tables based on schema definitions stored in TrustGraph's configuration system. These tables use a composite partition key structure with collection and schema-defined primary keys, enabling efficient queries within collections.
|
||||
|
||||
Current limitations that this specification addresses:
|
||||
- No query interface for the structured data stored in Cassandra
|
||||
- Inability to leverage GraphQL's powerful query capabilities for structured data
|
||||
- Missing support for relationship traversal between related objects
|
||||
- Lack of a standardized query language for structured data access
|
||||
|
||||
The GraphQL query service will bridge these gaps by:
|
||||
- Providing a standard GraphQL interface for querying Cassandra tables
|
||||
- Dynamically generating GraphQL schemas from TrustGraph configuration
|
||||
- Efficiently translating GraphQL queries to Cassandra CQL
|
||||
- Supporting relationship resolution through field resolvers
|
||||
|
||||
## Technical Design
|
||||
|
||||
### Architecture
|
||||
|
||||
The GraphQL query service will be implemented as a new TrustGraph flow processor following established patterns:
|
||||
|
||||
**Module Location**: `trustgraph-flow/trustgraph/query/objects/cassandra/`
|
||||
|
||||
**Key Components**:
|
||||
|
||||
1. **GraphQL Query Service Processor**
|
||||
- Extends base FlowProcessor class
|
||||
- Implements request/response pattern similar to existing query services
|
||||
- Monitors configuration for schema updates
|
||||
- Maintains GraphQL schema synchronized with configuration
|
||||
|
||||
2. **Dynamic Schema Generator**
|
||||
- Converts TrustGraph RowSchema definitions to GraphQL types
|
||||
- Creates GraphQL object types with proper field definitions
|
||||
- Generates root Query type with collection-based resolvers
|
||||
- Updates GraphQL schema when configuration changes
|
||||
|
||||
3. **Query Executor**
|
||||
- Parses incoming GraphQL queries using Strawberry library
|
||||
- Validates queries against current schema
|
||||
- Executes queries and returns structured responses
|
||||
- Handles errors gracefully with detailed error messages
|
||||
|
||||
4. **Cassandra Query Translator**
|
||||
- Converts GraphQL selections to CQL queries
|
||||
- Optimizes queries based on available indexes and partition keys
|
||||
- Handles filtering, pagination, and sorting
|
||||
- Manages connection pooling and session lifecycle
|
||||
|
||||
5. **Relationship Resolver**
|
||||
- Implements field resolvers for object relationships
|
||||
- Performs efficient batch loading to avoid N+1 queries
|
||||
- Caches resolved relationships within request context
|
||||
- Supports both forward and reverse relationship traversal
|
||||
|
||||
### Configuration Schema Monitoring
|
||||
|
||||
The service will register a configuration handler to receive schema updates:
|
||||
|
||||
```python
|
||||
self.register_config_handler(self.on_schema_config)
|
||||
```
|
||||
|
||||
When schemas change:
|
||||
1. Parse new schema definitions from configuration
|
||||
2. Regenerate GraphQL types and resolvers
|
||||
3. Update the executable schema
|
||||
4. Clear any schema-dependent caches
|
||||
|
||||
### GraphQL Schema Generation
|
||||
|
||||
For each RowSchema in configuration, generate:
|
||||
|
||||
1. **GraphQL Object Type**:
|
||||
- Map field types (string → String, integer → Int, float → Float, boolean → Boolean)
|
||||
- Mark required fields as non-nullable in GraphQL
|
||||
- Add field descriptions from schema
|
||||
|
||||
2. **Root Query Fields**:
|
||||
- Collection query (e.g., `customers`, `transactions`)
|
||||
- Filtering arguments based on indexed fields
|
||||
- Pagination support (limit, offset)
|
||||
- Sorting options for sortable fields
|
||||
|
||||
3. **Relationship Fields**:
|
||||
- Identify foreign key relationships from schema
|
||||
- Create field resolvers for related objects
|
||||
- Support both single object and list relationships
|
||||
|
||||
### Query Execution Flow
|
||||
|
||||
1. **Request Reception**:
|
||||
- Receive ObjectsQueryRequest from Pulsar
|
||||
- Extract GraphQL query string and variables
|
||||
- Identify user and collection context
|
||||
|
||||
2. **Query Validation**:
|
||||
- Parse GraphQL query using Strawberry
|
||||
- Validate against current schema
|
||||
- Check field selections and argument types
|
||||
|
||||
3. **CQL Generation**:
|
||||
- Analyze GraphQL selections
|
||||
- Build CQL query with proper WHERE clauses
|
||||
- Include collection in partition key
|
||||
- Apply filters based on GraphQL arguments
|
||||
|
||||
4. **Query Execution**:
|
||||
- Execute CQL query against Cassandra
|
||||
- Map results to GraphQL response structure
|
||||
- Resolve any relationship fields
|
||||
- Format response according to GraphQL spec
|
||||
|
||||
5. **Response Delivery**:
|
||||
- Create ObjectsQueryResponse with results
|
||||
- Include any execution errors
|
||||
- Send response via Pulsar with correlation ID
|
||||
|
||||
### Data Models
|
||||
|
||||
> **Note**: An existing StructuredQueryRequest/Response schema exists in `trustgraph-base/trustgraph/schema/services/structured_query.py`. However, it lacks critical fields (user, collection) and uses suboptimal types. The schemas below represent the recommended evolution, which should either replace the existing schemas or be created as new ObjectsQueryRequest/Response types.
|
||||
|
||||
#### Request Schema (ObjectsQueryRequest)
|
||||
|
||||
```python
|
||||
from pulsar.schema import Record, String, Map, Array
|
||||
|
||||
class ObjectsQueryRequest(Record):
|
||||
user = String() # Cassandra keyspace (follows pattern from TriplesQueryRequest)
|
||||
collection = String() # Data collection identifier (required for partition key)
|
||||
query = String() # GraphQL query string
|
||||
variables = Map(String()) # GraphQL variables (consider enhancing to support all JSON types)
|
||||
operation_name = String() # Operation to execute for multi-operation documents
|
||||
```
|
||||
|
||||
**Rationale for changes from existing StructuredQueryRequest:**
|
||||
- Added `user` and `collection` fields to match other query services pattern
|
||||
- These fields are essential for identifying the Cassandra keyspace and collection
|
||||
- Variables remain as Map(String()) for now but should ideally support all JSON types
|
||||
|
||||
#### Response Schema (ObjectsQueryResponse)
|
||||
|
||||
```python
|
||||
from pulsar.schema import Record, String, Array
|
||||
from ..core.primitives import Error
|
||||
|
||||
class GraphQLError(Record):
|
||||
message = String()
|
||||
path = Array(String()) # Path to the field that caused the error
|
||||
extensions = Map(String()) # Additional error metadata
|
||||
|
||||
class ObjectsQueryResponse(Record):
|
||||
error = Error() # System-level error (connection, timeout, etc.)
|
||||
data = String() # JSON-encoded GraphQL response data
|
||||
errors = Array(GraphQLError) # GraphQL field-level errors
|
||||
extensions = Map(String()) # Query metadata (execution time, etc.)
|
||||
```
|
||||
|
||||
**Rationale for changes from existing StructuredQueryResponse:**
|
||||
- Distinguishes between system errors (`error`) and GraphQL errors (`errors`)
|
||||
- Uses structured GraphQLError objects instead of string array
|
||||
- Adds `extensions` field for GraphQL spec compliance
|
||||
- Keeps data as JSON string for compatibility, though native types would be preferable
|
||||
|
||||
### Cassandra Query Optimization
|
||||
|
||||
The service will optimize Cassandra queries by:
|
||||
|
||||
1. **Respecting Partition Keys**:
|
||||
- Always include collection in queries
|
||||
- Use schema-defined primary keys efficiently
|
||||
- Avoid full table scans
|
||||
|
||||
2. **Leveraging Indexes**:
|
||||
- Use secondary indexes for filtering
|
||||
- Combine multiple filters when possible
|
||||
- Warn when queries may be inefficient
|
||||
|
||||
3. **Batch Loading**:
|
||||
- Collect relationship queries
|
||||
- Execute in batches to reduce round trips
|
||||
- Cache results within request context
|
||||
|
||||
4. **Connection Management**:
|
||||
- Maintain persistent Cassandra sessions
|
||||
- Use connection pooling
|
||||
- Handle reconnection on failures
|
||||
|
||||
### Example GraphQL Queries
|
||||
|
||||
#### Simple Collection Query
|
||||
```graphql
|
||||
{
|
||||
customers(status: "active") {
|
||||
customer_id
|
||||
name
|
||||
email
|
||||
registration_date
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Query with Relationships
|
||||
```graphql
|
||||
{
|
||||
orders(order_date_gt: "2024-01-01") {
|
||||
order_id
|
||||
total_amount
|
||||
customer {
|
||||
name
|
||||
email
|
||||
}
|
||||
items {
|
||||
product_name
|
||||
quantity
|
||||
price
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Paginated Query
|
||||
```graphql
|
||||
{
|
||||
products(limit: 20, offset: 40) {
|
||||
product_id
|
||||
name
|
||||
price
|
||||
category
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Implementation Dependencies
|
||||
|
||||
- **Strawberry GraphQL**: For GraphQL schema definition and query execution
|
||||
- **Cassandra Driver**: For database connectivity (already used in storage module)
|
||||
- **TrustGraph Base**: For FlowProcessor and schema definitions
|
||||
- **Configuration System**: For schema monitoring and updates
|
||||
|
||||
### Command-Line Interface
|
||||
|
||||
The service will provide a CLI command: `kg-query-objects-graphql-cassandra`
|
||||
|
||||
Arguments:
|
||||
- `--cassandra-host`: Cassandra cluster contact point
|
||||
- `--cassandra-username`: Authentication username
|
||||
- `--cassandra-password`: Authentication password
|
||||
- `--config-type`: Configuration type for schemas (default: "schema")
|
||||
- Standard FlowProcessor arguments (Pulsar configuration, etc.)
|
||||
|
||||
## API Integration
|
||||
|
||||
### Pulsar Topics
|
||||
|
||||
**Input Topic**: `objects-graphql-query-request`
|
||||
- Schema: ObjectsQueryRequest
|
||||
- Receives GraphQL queries from gateway services
|
||||
|
||||
**Output Topic**: `objects-graphql-query-response`
|
||||
- Schema: ObjectsQueryResponse
|
||||
- Returns query results and errors
|
||||
|
||||
### Gateway Integration
|
||||
|
||||
The gateway and reverse-gateway will need endpoints to:
|
||||
1. Accept GraphQL queries from clients
|
||||
2. Forward to the query service via Pulsar
|
||||
3. Return responses to clients
|
||||
4. Support GraphQL introspection queries
|
||||
|
||||
### Agent Tool Integration
|
||||
|
||||
A new agent tool class will enable:
|
||||
- Natural language to GraphQL query generation
|
||||
- Direct GraphQL query execution
|
||||
- Result interpretation and formatting
|
||||
- Integration with agent decision flows
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- **Query Depth Limiting**: Prevent deeply nested queries that could cause performance issues
|
||||
- **Query Complexity Analysis**: Limit query complexity to prevent resource exhaustion
|
||||
- **Field-Level Permissions**: Future support for field-level access control based on user roles
|
||||
- **Input Sanitization**: Validate and sanitize all query inputs to prevent injection attacks
|
||||
- **Rate Limiting**: Implement query rate limiting per user/collection
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Query Planning**: Analyze queries before execution to optimize CQL generation
|
||||
- **Result Caching**: Consider caching frequently accessed data at the field resolver level
|
||||
- **Connection Pooling**: Maintain efficient connection pools to Cassandra
|
||||
- **Batch Operations**: Combine multiple queries when possible to reduce latency
|
||||
- **Monitoring**: Track query performance metrics for optimization
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
- Schema generation from RowSchema definitions
|
||||
- GraphQL query parsing and validation
|
||||
- CQL query generation logic
|
||||
- Field resolver implementations
|
||||
|
||||
### Contract Tests
|
||||
- Pulsar message contract compliance
|
||||
- GraphQL schema validity
|
||||
- Response format verification
|
||||
- Error structure validation
|
||||
|
||||
### Integration Tests
|
||||
- End-to-end query execution against test Cassandra instance
|
||||
- Schema update handling
|
||||
- Relationship resolution
|
||||
- Pagination and filtering
|
||||
- Error scenarios
|
||||
|
||||
### Performance Tests
|
||||
- Query throughput under load
|
||||
- Response time for various query complexities
|
||||
- Memory usage with large result sets
|
||||
- Connection pool efficiency
|
||||
|
||||
## Migration Plan
|
||||
|
||||
No migration required as this is a new capability. The service will:
|
||||
1. Read existing schemas from configuration
|
||||
2. Connect to existing Cassandra tables created by the storage module
|
||||
3. Start accepting queries immediately upon deployment
|
||||
|
||||
## Timeline
|
||||
|
||||
- Week 1-2: Core service implementation and schema generation
|
||||
- Week 3: Query execution and CQL translation
|
||||
- Week 4: Relationship resolution and optimization
|
||||
- Week 5: Testing and performance tuning
|
||||
- Week 6: Gateway integration and documentation
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **Schema Evolution**: How should the service handle queries during schema transitions?
|
||||
- Option: Queue queries during schema updates
|
||||
- Option: Support multiple schema versions simultaneously
|
||||
|
||||
2. **Caching Strategy**: Should query results be cached?
|
||||
- Consider: Time-based expiration
|
||||
- Consider: Event-based invalidation
|
||||
|
||||
3. **Federation Support**: Should the service support GraphQL federation for combining with other data sources?
|
||||
- Would enable unified queries across structured and graph data
|
||||
|
||||
4. **Subscription Support**: Should the service support GraphQL subscriptions for real-time updates?
|
||||
- Would require WebSocket support in gateway
|
||||
|
||||
5. **Custom Scalars**: Should custom scalar types be supported for domain-specific data types?
|
||||
- Examples: DateTime, UUID, JSON fields
|
||||
|
||||
## References
|
||||
|
||||
- Structured Data Technical Specification: `docs/tech-specs/structured-data.md`
|
||||
- Strawberry GraphQL Documentation: https://strawberry.rocks/
|
||||
- GraphQL Specification: https://spec.graphql.org/
|
||||
- Apache Cassandra CQL Reference: https://cassandra.apache.org/doc/stable/cassandra/cql/
|
||||
- TrustGraph Flow Processor Documentation: Internal documentation
|
||||
682
docs/tech-specs/import-export-graceful-shutdown.md
Normal file
682
docs/tech-specs/import-export-graceful-shutdown.md
Normal file
|
|
@ -0,0 +1,682 @@
|
|||
# Import/Export Graceful Shutdown Technical Specification
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The TrustGraph gateway currently experiences message loss during websocket closure in both import and export operations. This occurs due to race conditions where messages in transit are discarded before reaching their destination (Pulsar queues for imports, websocket clients for exports).
|
||||
|
||||
### Import-Side Issues
|
||||
1. Publisher's asyncio.Queue buffer is not drained on shutdown
|
||||
2. Websocket closes before ensuring queued messages reach Pulsar
|
||||
3. No acknowledgment mechanism for successful message delivery
|
||||
|
||||
### Export-Side Issues
|
||||
1. Messages are acknowledged in Pulsar before successful delivery to clients
|
||||
2. Hard-coded timeouts cause message drops when queues are full
|
||||
3. No backpressure mechanism for handling slow consumers
|
||||
4. Multiple buffer points where data can be lost
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
Import Flow:
|
||||
Client -> Websocket -> TriplesImport -> Publisher -> Pulsar Queue
|
||||
|
||||
Export Flow:
|
||||
Pulsar Queue -> Subscriber -> TriplesExport -> Websocket -> Client
|
||||
```
|
||||
|
||||
## Proposed Fixes
|
||||
|
||||
### 1. Publisher Improvements (Import Side)
|
||||
|
||||
#### A. Graceful Queue Draining
|
||||
|
||||
**File**: `trustgraph-base/trustgraph/base/publisher.py`
|
||||
|
||||
```python
|
||||
class Publisher:
|
||||
def __init__(self, client, topic, schema=None, max_size=10,
|
||||
chunking_enabled=True, drain_timeout=5.0):
|
||||
self.client = client
|
||||
self.topic = topic
|
||||
self.schema = schema
|
||||
self.q = asyncio.Queue(maxsize=max_size)
|
||||
self.chunking_enabled = chunking_enabled
|
||||
self.running = True
|
||||
self.draining = False # New state for graceful shutdown
|
||||
self.task = None
|
||||
self.drain_timeout = drain_timeout
|
||||
|
||||
async def stop(self):
|
||||
"""Initiate graceful shutdown with draining"""
|
||||
self.running = False
|
||||
self.draining = True
|
||||
|
||||
if self.task:
|
||||
# Wait for run() to complete draining
|
||||
await self.task
|
||||
|
||||
async def run(self):
|
||||
"""Enhanced run method with integrated draining logic"""
|
||||
while self.running or self.draining:
|
||||
try:
|
||||
producer = self.client.create_producer(
|
||||
topic=self.topic,
|
||||
schema=JsonSchema(self.schema),
|
||||
chunking_enabled=self.chunking_enabled,
|
||||
)
|
||||
|
||||
drain_end_time = None
|
||||
|
||||
while self.running or self.draining:
|
||||
try:
|
||||
# Start drain timeout when entering drain mode
|
||||
if self.draining and drain_end_time is None:
|
||||
drain_end_time = time.time() + self.drain_timeout
|
||||
logger.info(f"Publisher entering drain mode, timeout={self.drain_timeout}s")
|
||||
|
||||
# Check drain timeout
|
||||
if self.draining and time.time() > drain_end_time:
|
||||
if not self.q.empty():
|
||||
logger.warning(f"Drain timeout reached with {self.q.qsize()} messages remaining")
|
||||
self.draining = False
|
||||
break
|
||||
|
||||
# Calculate wait timeout based on mode
|
||||
if self.draining:
|
||||
# Shorter timeout during draining to exit quickly when empty
|
||||
timeout = min(0.1, drain_end_time - time.time())
|
||||
else:
|
||||
# Normal operation timeout
|
||||
timeout = 0.25
|
||||
|
||||
# Get message from queue
|
||||
id, item = await asyncio.wait_for(
|
||||
self.q.get(),
|
||||
timeout=timeout
|
||||
)
|
||||
|
||||
# Send the message (single place for sending)
|
||||
if id:
|
||||
producer.send(item, { "id": id })
|
||||
else:
|
||||
producer.send(item)
|
||||
|
||||
except asyncio.TimeoutError:
|
||||
# If draining and queue is empty, we're done
|
||||
if self.draining and self.q.empty():
|
||||
logger.info("Publisher queue drained successfully")
|
||||
self.draining = False
|
||||
break
|
||||
continue
|
||||
|
||||
except asyncio.QueueEmpty:
|
||||
# If draining and queue is empty, we're done
|
||||
if self.draining and self.q.empty():
|
||||
logger.info("Publisher queue drained successfully")
|
||||
self.draining = False
|
||||
break
|
||||
continue
|
||||
|
||||
# Flush producer before closing
|
||||
if producer:
|
||||
producer.flush()
|
||||
producer.close()
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Exception in publisher: {e}", exc_info=True)
|
||||
|
||||
if not self.running and not self.draining:
|
||||
return
|
||||
|
||||
# If handler drops out, sleep a retry
|
||||
await asyncio.sleep(1)
|
||||
|
||||
async def send(self, id, item):
|
||||
"""Send still works normally - just adds to queue"""
|
||||
if self.draining:
|
||||
# Optionally reject new messages during drain
|
||||
raise RuntimeError("Publisher is shutting down, not accepting new messages")
|
||||
await self.q.put((id, item))
|
||||
```
|
||||
|
||||
**Key Design Benefits:**
|
||||
- **Single Send Location**: All `producer.send()` calls happen in one place within the `run()` method
|
||||
- **Clean State Machine**: Three clear states - running, draining, stopped
|
||||
- **Timeout Protection**: Won't hang indefinitely during drain
|
||||
- **Better Observability**: Clear logging of drain progress and state transitions
|
||||
- **Optional Message Rejection**: Can reject new messages during shutdown phase
|
||||
|
||||
#### B. Improved Shutdown Order
|
||||
|
||||
**File**: `trustgraph-flow/trustgraph/gateway/dispatch/triples_import.py`
|
||||
|
||||
```python
|
||||
class TriplesImport:
|
||||
async def destroy(self):
|
||||
"""Enhanced destroy with proper shutdown order"""
|
||||
# Step 1: Stop accepting new messages
|
||||
self.running.stop()
|
||||
|
||||
# Step 2: Wait for publisher to drain its queue
|
||||
logger.info("Draining publisher queue...")
|
||||
await self.publisher.stop()
|
||||
|
||||
# Step 3: Close websocket only after queue is drained
|
||||
if self.ws:
|
||||
await self.ws.close()
|
||||
```
|
||||
|
||||
### 2. Subscriber Improvements (Export Side)
|
||||
|
||||
#### A. Integrated Draining Pattern
|
||||
|
||||
**File**: `trustgraph-base/trustgraph/base/subscriber.py`
|
||||
|
||||
```python
|
||||
class Subscriber:
|
||||
def __init__(self, client, topic, subscription, consumer_name,
|
||||
schema=None, max_size=100, metrics=None,
|
||||
backpressure_strategy="block", drain_timeout=5.0):
|
||||
# ... existing init ...
|
||||
self.backpressure_strategy = backpressure_strategy
|
||||
self.running = True
|
||||
self.draining = False # New state for graceful shutdown
|
||||
self.drain_timeout = drain_timeout
|
||||
self.pending_acks = {} # Track messages awaiting delivery
|
||||
|
||||
async def stop(self):
|
||||
"""Initiate graceful shutdown with draining"""
|
||||
self.running = False
|
||||
self.draining = True
|
||||
|
||||
if self.task:
|
||||
# Wait for run() to complete draining
|
||||
await self.task
|
||||
|
||||
async def run(self):
|
||||
"""Enhanced run method with integrated draining logic"""
|
||||
while self.running or self.draining:
|
||||
if self.metrics:
|
||||
self.metrics.state("stopped")
|
||||
|
||||
try:
|
||||
self.consumer = self.client.subscribe(
|
||||
topic = self.topic,
|
||||
subscription_name = self.subscription,
|
||||
consumer_name = self.consumer_name,
|
||||
schema = JsonSchema(self.schema),
|
||||
)
|
||||
|
||||
if self.metrics:
|
||||
self.metrics.state("running")
|
||||
|
||||
logger.info("Subscriber running...")
|
||||
drain_end_time = None
|
||||
|
||||
while self.running or self.draining:
|
||||
# Start drain timeout when entering drain mode
|
||||
if self.draining and drain_end_time is None:
|
||||
drain_end_time = time.time() + self.drain_timeout
|
||||
logger.info(f"Subscriber entering drain mode, timeout={self.drain_timeout}s")
|
||||
|
||||
# Stop accepting new messages from Pulsar during drain
|
||||
self.consumer.pause_message_listener()
|
||||
|
||||
# Check drain timeout
|
||||
if self.draining and time.time() > drain_end_time:
|
||||
async with self.lock:
|
||||
total_pending = sum(
|
||||
q.qsize() for q in
|
||||
list(self.q.values()) + list(self.full.values())
|
||||
)
|
||||
if total_pending > 0:
|
||||
logger.warning(f"Drain timeout reached with {total_pending} messages in queues")
|
||||
self.draining = False
|
||||
break
|
||||
|
||||
# Check if we can exit drain mode
|
||||
if self.draining:
|
||||
async with self.lock:
|
||||
all_empty = all(
|
||||
q.empty() for q in
|
||||
list(self.q.values()) + list(self.full.values())
|
||||
)
|
||||
if all_empty and len(self.pending_acks) == 0:
|
||||
logger.info("Subscriber queues drained successfully")
|
||||
self.draining = False
|
||||
break
|
||||
|
||||
# Process messages only if not draining
|
||||
if not self.draining:
|
||||
try:
|
||||
msg = await asyncio.to_thread(
|
||||
self.consumer.receive,
|
||||
timeout_millis=250
|
||||
)
|
||||
except _pulsar.Timeout:
|
||||
continue
|
||||
except Exception as e:
|
||||
logger.error(f"Exception in subscriber receive: {e}", exc_info=True)
|
||||
raise e
|
||||
|
||||
if self.metrics:
|
||||
self.metrics.received()
|
||||
|
||||
# Process the message
|
||||
await self._process_message(msg)
|
||||
else:
|
||||
# During draining, just wait for queues to empty
|
||||
await asyncio.sleep(0.1)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Subscriber exception: {e}", exc_info=True)
|
||||
|
||||
finally:
|
||||
# Negative acknowledge any pending messages
|
||||
for msg in self.pending_acks.values():
|
||||
self.consumer.negative_acknowledge(msg)
|
||||
self.pending_acks.clear()
|
||||
|
||||
if self.consumer:
|
||||
self.consumer.unsubscribe()
|
||||
self.consumer.close()
|
||||
self.consumer = None
|
||||
|
||||
if self.metrics:
|
||||
self.metrics.state("stopped")
|
||||
|
||||
if not self.running and not self.draining:
|
||||
return
|
||||
|
||||
# If handler drops out, sleep a retry
|
||||
await asyncio.sleep(1)
|
||||
|
||||
async def _process_message(self, msg):
|
||||
"""Process a single message with deferred acknowledgment"""
|
||||
# Store message for later acknowledgment
|
||||
msg_id = str(uuid.uuid4())
|
||||
self.pending_acks[msg_id] = msg
|
||||
|
||||
try:
|
||||
id = msg.properties()["id"]
|
||||
except:
|
||||
id = None
|
||||
|
||||
value = msg.value()
|
||||
delivery_success = False
|
||||
|
||||
async with self.lock:
|
||||
# Deliver to specific subscribers
|
||||
if id in self.q:
|
||||
delivery_success = await self._deliver_to_queue(
|
||||
self.q[id], value
|
||||
)
|
||||
|
||||
# Deliver to all subscribers
|
||||
for q in self.full.values():
|
||||
if await self._deliver_to_queue(q, value):
|
||||
delivery_success = True
|
||||
|
||||
# Acknowledge only on successful delivery
|
||||
if delivery_success:
|
||||
self.consumer.acknowledge(msg)
|
||||
del self.pending_acks[msg_id]
|
||||
else:
|
||||
# Negative acknowledge for retry
|
||||
self.consumer.negative_acknowledge(msg)
|
||||
del self.pending_acks[msg_id]
|
||||
|
||||
async def _deliver_to_queue(self, queue, value):
|
||||
"""Deliver message to queue with backpressure handling"""
|
||||
try:
|
||||
if self.backpressure_strategy == "block":
|
||||
# Block until space available (no timeout)
|
||||
await queue.put(value)
|
||||
return True
|
||||
|
||||
elif self.backpressure_strategy == "drop_oldest":
|
||||
# Drop oldest message if queue full
|
||||
if queue.full():
|
||||
try:
|
||||
queue.get_nowait()
|
||||
if self.metrics:
|
||||
self.metrics.dropped()
|
||||
except asyncio.QueueEmpty:
|
||||
pass
|
||||
await queue.put(value)
|
||||
return True
|
||||
|
||||
elif self.backpressure_strategy == "drop_new":
|
||||
# Drop new message if queue full
|
||||
if queue.full():
|
||||
if self.metrics:
|
||||
self.metrics.dropped()
|
||||
return False
|
||||
await queue.put(value)
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to deliver message: {e}")
|
||||
return False
|
||||
```
|
||||
|
||||
**Key Design Benefits (matching Publisher pattern):**
|
||||
- **Single Processing Location**: All message processing happens in the `run()` method
|
||||
- **Clean State Machine**: Three clear states - running, draining, stopped
|
||||
- **Pause During Drain**: Stops accepting new messages from Pulsar while draining existing queues
|
||||
- **Timeout Protection**: Won't hang indefinitely during drain
|
||||
- **Proper Cleanup**: Negative acknowledges any undelivered messages on shutdown
|
||||
|
||||
#### B. Export Handler Improvements
|
||||
|
||||
**File**: `trustgraph-flow/trustgraph/gateway/dispatch/triples_export.py`
|
||||
|
||||
```python
|
||||
class TriplesExport:
|
||||
async def destroy(self):
|
||||
"""Enhanced destroy with graceful shutdown"""
|
||||
# Step 1: Signal stop to prevent new messages
|
||||
self.running.stop()
|
||||
|
||||
# Step 2: Wait briefly for in-flight messages
|
||||
await asyncio.sleep(0.5)
|
||||
|
||||
# Step 3: Unsubscribe and stop subscriber (triggers queue drain)
|
||||
if hasattr(self, 'subs'):
|
||||
await self.subs.unsubscribe_all(self.id)
|
||||
await self.subs.stop()
|
||||
|
||||
# Step 4: Close websocket last
|
||||
if self.ws and not self.ws.closed:
|
||||
await self.ws.close()
|
||||
|
||||
async def run(self):
|
||||
"""Enhanced run with better error handling"""
|
||||
self.subs = Subscriber(
|
||||
client = self.pulsar_client,
|
||||
topic = self.queue,
|
||||
consumer_name = self.consumer,
|
||||
subscription = self.subscriber,
|
||||
schema = Triples,
|
||||
backpressure_strategy = "block" # Configurable
|
||||
)
|
||||
|
||||
await self.subs.start()
|
||||
|
||||
self.id = str(uuid.uuid4())
|
||||
q = await self.subs.subscribe_all(self.id)
|
||||
|
||||
consecutive_errors = 0
|
||||
max_consecutive_errors = 5
|
||||
|
||||
while self.running.get():
|
||||
try:
|
||||
resp = await asyncio.wait_for(q.get(), timeout=0.5)
|
||||
await self.ws.send_json(serialize_triples(resp))
|
||||
consecutive_errors = 0 # Reset on success
|
||||
|
||||
except asyncio.TimeoutError:
|
||||
continue
|
||||
|
||||
except queue.Empty:
|
||||
continue
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Exception sending to websocket: {str(e)}")
|
||||
consecutive_errors += 1
|
||||
|
||||
if consecutive_errors >= max_consecutive_errors:
|
||||
logger.error("Too many consecutive errors, shutting down")
|
||||
break
|
||||
|
||||
# Brief pause before retry
|
||||
await asyncio.sleep(0.1)
|
||||
|
||||
# Graceful cleanup handled in destroy()
|
||||
```
|
||||
|
||||
### 3. Socket-Level Improvements
|
||||
|
||||
**File**: `trustgraph-flow/trustgraph/gateway/endpoint/socket.py`
|
||||
|
||||
```python
|
||||
class SocketEndpoint:
|
||||
async def listener(self, ws, dispatcher, running):
|
||||
"""Enhanced listener with graceful shutdown"""
|
||||
async for msg in ws:
|
||||
if msg.type == WSMsgType.TEXT:
|
||||
await dispatcher.receive(msg)
|
||||
continue
|
||||
elif msg.type == WSMsgType.BINARY:
|
||||
await dispatcher.receive(msg)
|
||||
continue
|
||||
else:
|
||||
# Graceful shutdown on close
|
||||
logger.info("Websocket closing, initiating graceful shutdown")
|
||||
running.stop()
|
||||
|
||||
# Allow time for dispatcher cleanup
|
||||
await asyncio.sleep(1.0)
|
||||
break
|
||||
|
||||
async def handle(self, request):
|
||||
"""Enhanced handler with better cleanup"""
|
||||
# ... existing setup code ...
|
||||
|
||||
try:
|
||||
async with asyncio.TaskGroup() as tg:
|
||||
running = Running()
|
||||
|
||||
dispatcher = await self.dispatcher(
|
||||
ws, running, request.match_info
|
||||
)
|
||||
|
||||
worker_task = tg.create_task(
|
||||
self.worker(ws, dispatcher, running)
|
||||
)
|
||||
|
||||
lsnr_task = tg.create_task(
|
||||
self.listener(ws, dispatcher, running)
|
||||
)
|
||||
|
||||
except ExceptionGroup as e:
|
||||
logger.error("Exception group occurred:", exc_info=True)
|
||||
|
||||
# Attempt graceful dispatcher shutdown
|
||||
try:
|
||||
await asyncio.wait_for(
|
||||
dispatcher.destroy(),
|
||||
timeout=5.0
|
||||
)
|
||||
except asyncio.TimeoutError:
|
||||
logger.warning("Dispatcher shutdown timed out")
|
||||
except Exception as de:
|
||||
logger.error(f"Error during dispatcher cleanup: {de}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Socket exception: {e}", exc_info=True)
|
||||
|
||||
finally:
|
||||
# Ensure dispatcher cleanup
|
||||
if dispatcher and hasattr(dispatcher, 'destroy'):
|
||||
try:
|
||||
await dispatcher.destroy()
|
||||
except:
|
||||
pass
|
||||
|
||||
# Ensure websocket is closed
|
||||
if ws and not ws.closed:
|
||||
await ws.close()
|
||||
|
||||
return ws
|
||||
```
|
||||
|
||||
## Configuration Options
|
||||
|
||||
Add configuration support for tuning behavior:
|
||||
|
||||
```python
|
||||
# config.py
|
||||
class GracefulShutdownConfig:
|
||||
# Publisher settings
|
||||
PUBLISHER_DRAIN_TIMEOUT = 5.0 # Seconds to wait for queue drain
|
||||
PUBLISHER_FLUSH_TIMEOUT = 2.0 # Producer flush timeout
|
||||
|
||||
# Subscriber settings
|
||||
SUBSCRIBER_DRAIN_TIMEOUT = 5.0 # Seconds to wait for queue drain
|
||||
BACKPRESSURE_STRATEGY = "block" # Options: "block", "drop_oldest", "drop_new"
|
||||
SUBSCRIBER_MAX_QUEUE_SIZE = 100 # Maximum queue size before backpressure
|
||||
|
||||
# Socket settings
|
||||
SHUTDOWN_GRACE_PERIOD = 1.0 # Seconds to wait for graceful shutdown
|
||||
MAX_CONSECUTIVE_ERRORS = 5 # Maximum errors before forced shutdown
|
||||
|
||||
# Monitoring
|
||||
LOG_QUEUE_STATS = True # Log queue statistics on shutdown
|
||||
METRICS_ENABLED = True # Enable metrics collection
|
||||
```
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
|
||||
```python
|
||||
async def test_publisher_queue_drain():
|
||||
"""Verify Publisher drains queue on shutdown"""
|
||||
publisher = Publisher(...)
|
||||
|
||||
# Fill queue with messages
|
||||
for i in range(10):
|
||||
await publisher.send(f"id-{i}", {"data": i})
|
||||
|
||||
# Stop publisher
|
||||
await publisher.stop()
|
||||
|
||||
# Verify all messages were sent
|
||||
assert publisher.q.empty()
|
||||
assert mock_producer.send.call_count == 10
|
||||
|
||||
async def test_subscriber_deferred_ack():
|
||||
"""Verify Subscriber only acks on successful delivery"""
|
||||
subscriber = Subscriber(..., backpressure_strategy="drop_new")
|
||||
|
||||
# Fill queue to capacity
|
||||
queue = await subscriber.subscribe("test")
|
||||
for i in range(100):
|
||||
await queue.put({"data": i})
|
||||
|
||||
# Try to add message when full
|
||||
msg = create_mock_message()
|
||||
await subscriber._process_message(msg)
|
||||
|
||||
# Verify negative acknowledgment
|
||||
assert msg.negative_acknowledge.called
|
||||
assert not msg.acknowledge.called
|
||||
```
|
||||
|
||||
### Integration Tests
|
||||
|
||||
```python
|
||||
async def test_import_graceful_shutdown():
|
||||
"""Test import path handles shutdown gracefully"""
|
||||
# Setup
|
||||
import_handler = TriplesImport(...)
|
||||
await import_handler.start()
|
||||
|
||||
# Send messages
|
||||
messages = []
|
||||
for i in range(100):
|
||||
msg = {"metadata": {...}, "triples": [...]}
|
||||
await import_handler.receive(msg)
|
||||
messages.append(msg)
|
||||
|
||||
# Shutdown while messages in flight
|
||||
await import_handler.destroy()
|
||||
|
||||
# Verify all messages reached Pulsar
|
||||
received = await pulsar_consumer.receive_all()
|
||||
assert len(received) == 100
|
||||
|
||||
async def test_export_no_message_loss():
|
||||
"""Test export path doesn't lose acknowledged messages"""
|
||||
# Setup Pulsar with test messages
|
||||
for i in range(100):
|
||||
await pulsar_producer.send({"data": i})
|
||||
|
||||
# Start export handler
|
||||
export_handler = TriplesExport(...)
|
||||
export_task = asyncio.create_task(export_handler.run())
|
||||
|
||||
# Receive some messages
|
||||
received = []
|
||||
for _ in range(50):
|
||||
msg = await websocket.receive()
|
||||
received.append(msg)
|
||||
|
||||
# Force shutdown
|
||||
await export_handler.destroy()
|
||||
|
||||
# Continue receiving until websocket closes
|
||||
while not websocket.closed:
|
||||
try:
|
||||
msg = await websocket.receive()
|
||||
received.append(msg)
|
||||
except:
|
||||
break
|
||||
|
||||
# Verify no acknowledged messages were lost
|
||||
assert len(received) >= 50
|
||||
```
|
||||
|
||||
## Rollout Plan
|
||||
|
||||
### Phase 1: Critical Fixes (Week 1)
|
||||
- Fix Subscriber acknowledgment timing (prevent message loss)
|
||||
- Add Publisher queue draining
|
||||
- Deploy to staging environment
|
||||
|
||||
### Phase 2: Graceful Shutdown (Week 2)
|
||||
- Implement shutdown coordination
|
||||
- Add backpressure strategies
|
||||
- Performance testing
|
||||
|
||||
### Phase 3: Monitoring & Tuning (Week 3)
|
||||
- Add metrics for queue depths
|
||||
- Add alerts for message drops
|
||||
- Tune timeout values based on production data
|
||||
|
||||
## Monitoring & Alerts
|
||||
|
||||
### Metrics to Track
|
||||
- `publisher.queue.depth` - Current Publisher queue size
|
||||
- `publisher.messages.dropped` - Messages lost during shutdown
|
||||
- `subscriber.messages.negatively_acknowledged` - Failed deliveries
|
||||
- `websocket.graceful_shutdowns` - Successful graceful shutdowns
|
||||
- `websocket.forced_shutdowns` - Forced/timeout shutdowns
|
||||
|
||||
### Alerts
|
||||
- Publisher queue depth > 80% capacity
|
||||
- Any message drops during shutdown
|
||||
- Subscriber negative acknowledgment rate > 1%
|
||||
- Shutdown timeout exceeded
|
||||
|
||||
## Backwards Compatibility
|
||||
|
||||
All changes maintain backwards compatibility:
|
||||
- Default behavior unchanged without configuration
|
||||
- Existing deployments continue to function
|
||||
- Graceful degradation if new features unavailable
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- No new attack vectors introduced
|
||||
- Backpressure prevents memory exhaustion attacks
|
||||
- Configurable limits prevent resource abuse
|
||||
|
||||
## Performance Impact
|
||||
|
||||
- Minimal overhead during normal operation
|
||||
- Shutdown may take up to 5 seconds longer (configurable)
|
||||
- Memory usage bounded by queue size limits
|
||||
- CPU impact negligible (<1% increase)
|
||||
359
docs/tech-specs/neo4j-user-collection-isolation.md
Normal file
359
docs/tech-specs/neo4j-user-collection-isolation.md
Normal file
|
|
@ -0,0 +1,359 @@
|
|||
# Neo4j User/Collection Isolation Support
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The Neo4j triples storage and query implementation currently lacks user/collection isolation, which creates a multi-tenancy security issue. All triples are stored in the same graph space without any mechanism to prevent users from accessing other users' data or mixing collections.
|
||||
|
||||
Unlike other storage backends in TrustGraph:
|
||||
- **Cassandra**: Uses separate keyspaces per user and tables per collection
|
||||
- **Vector stores** (Milvus, Qdrant, Pinecone): Use collection-specific namespaces
|
||||
- **Neo4j**: Currently shares all data in a single graph (security vulnerability)
|
||||
|
||||
## Current Architecture
|
||||
|
||||
### Data Model
|
||||
- **Nodes**: `:Node` label with `uri` property, `:Literal` label with `value` property
|
||||
- **Relationships**: `:Rel` label with `uri` property
|
||||
- **Indexes**: `Node.uri`, `Literal.value`, `Rel.uri`
|
||||
|
||||
### Message Flow
|
||||
- `Triples` messages contain `metadata.user` and `metadata.collection` fields
|
||||
- Storage service receives user/collection info but ignores it
|
||||
- Query service expects `user` and `collection` in `TriplesQueryRequest` but ignores them
|
||||
|
||||
### Current Security Issue
|
||||
```cypher
|
||||
# Any user can query any data - no isolation
|
||||
MATCH (src:Node)-[rel:Rel]->(dest:Node)
|
||||
RETURN src.uri, rel.uri, dest.uri
|
||||
```
|
||||
|
||||
## Proposed Solution: Property-Based Filtering (Recommended)
|
||||
|
||||
### Overview
|
||||
Add `user` and `collection` properties to all nodes and relationships, then filter all operations by these properties. This approach provides strong isolation while maintaining query flexibility and backwards compatibility.
|
||||
|
||||
### Data Model Changes
|
||||
|
||||
#### Enhanced Node Structure
|
||||
```cypher
|
||||
// Node entities
|
||||
CREATE (n:Node {
|
||||
uri: "http://example.com/entity1",
|
||||
user: "john_doe",
|
||||
collection: "production_v1"
|
||||
})
|
||||
|
||||
// Literal entities
|
||||
CREATE (n:Literal {
|
||||
value: "literal value",
|
||||
user: "john_doe",
|
||||
collection: "production_v1"
|
||||
})
|
||||
```
|
||||
|
||||
#### Enhanced Relationship Structure
|
||||
```cypher
|
||||
// Relationships with user/collection properties
|
||||
CREATE (src)-[:Rel {
|
||||
uri: "http://example.com/predicate1",
|
||||
user: "john_doe",
|
||||
collection: "production_v1"
|
||||
}]->(dest)
|
||||
```
|
||||
|
||||
#### Updated Indexes
|
||||
```cypher
|
||||
// Compound indexes for efficient filtering
|
||||
CREATE INDEX node_user_collection_uri FOR (n:Node) ON (n.user, n.collection, n.uri);
|
||||
CREATE INDEX literal_user_collection_value FOR (n:Literal) ON (n.user, n.collection, n.value);
|
||||
CREATE INDEX rel_user_collection_uri FOR ()-[r:Rel]-() ON (r.user, r.collection, r.uri);
|
||||
|
||||
// Maintain existing indexes for backwards compatibility (optional)
|
||||
CREATE INDEX Node_uri FOR (n:Node) ON (n.uri);
|
||||
CREATE INDEX Literal_value FOR (n:Literal) ON (n.value);
|
||||
CREATE INDEX Rel_uri FOR ()-[r:Rel]-() ON (r.uri);
|
||||
```
|
||||
|
||||
### Implementation Changes
|
||||
|
||||
#### Storage Service (`write.py`)
|
||||
|
||||
**Current Code:**
|
||||
```python
|
||||
def create_node(self, uri):
|
||||
summary = self.io.execute_query(
|
||||
"MERGE (n:Node {uri: $uri})",
|
||||
uri=uri, database_=self.db,
|
||||
).summary
|
||||
```
|
||||
|
||||
**Updated Code:**
|
||||
```python
|
||||
def create_node(self, uri, user, collection):
|
||||
summary = self.io.execute_query(
|
||||
"MERGE (n:Node {uri: $uri, user: $user, collection: $collection})",
|
||||
uri=uri, user=user, collection=collection, database_=self.db,
|
||||
).summary
|
||||
```
|
||||
|
||||
**Enhanced store_triples Method:**
|
||||
```python
|
||||
async def store_triples(self, message):
|
||||
user = message.metadata.user
|
||||
collection = message.metadata.collection
|
||||
|
||||
for t in message.triples:
|
||||
self.create_node(t.s.value, user, collection)
|
||||
|
||||
if t.o.is_uri:
|
||||
self.create_node(t.o.value, user, collection)
|
||||
self.relate_node(t.s.value, t.p.value, t.o.value, user, collection)
|
||||
else:
|
||||
self.create_literal(t.o.value, user, collection)
|
||||
self.relate_literal(t.s.value, t.p.value, t.o.value, user, collection)
|
||||
```
|
||||
|
||||
#### Query Service (`service.py`)
|
||||
|
||||
**Current Code:**
|
||||
```python
|
||||
records, summary, keys = self.io.execute_query(
|
||||
"MATCH (src:Node {uri: $src})-[rel:Rel {uri: $rel}]->(dest:Node) "
|
||||
"RETURN dest.uri as dest",
|
||||
src=query.s.value, rel=query.p.value, database_=self.db,
|
||||
)
|
||||
```
|
||||
|
||||
**Updated Code:**
|
||||
```python
|
||||
records, summary, keys = self.io.execute_query(
|
||||
"MATCH (src:Node {uri: $src, user: $user, collection: $collection})-"
|
||||
"[rel:Rel {uri: $rel, user: $user, collection: $collection}]->"
|
||||
"(dest:Node {user: $user, collection: $collection}) "
|
||||
"RETURN dest.uri as dest",
|
||||
src=query.s.value, rel=query.p.value,
|
||||
user=query.user, collection=query.collection,
|
||||
database_=self.db,
|
||||
)
|
||||
```
|
||||
|
||||
### Migration Strategy
|
||||
|
||||
#### Phase 1: Add Properties to New Data
|
||||
1. Update storage service to add user/collection properties to new triples
|
||||
2. Maintain backwards compatibility by not requiring properties in queries
|
||||
3. Existing data remains accessible but not isolated
|
||||
|
||||
#### Phase 2: Migrate Existing Data
|
||||
```cypher
|
||||
// Migrate existing nodes (requires default user/collection assignment)
|
||||
MATCH (n:Node) WHERE n.user IS NULL
|
||||
SET n.user = 'legacy_user', n.collection = 'default_collection';
|
||||
|
||||
MATCH (n:Literal) WHERE n.user IS NULL
|
||||
SET n.user = 'legacy_user', n.collection = 'default_collection';
|
||||
|
||||
MATCH ()-[r:Rel]->() WHERE r.user IS NULL
|
||||
SET r.user = 'legacy_user', r.collection = 'default_collection';
|
||||
```
|
||||
|
||||
#### Phase 3: Enforce Isolation
|
||||
1. Update query service to require user/collection filtering
|
||||
2. Add validation to reject queries without proper user/collection context
|
||||
3. Remove legacy data access paths
|
||||
|
||||
### Security Considerations
|
||||
|
||||
#### Query Validation
|
||||
```python
|
||||
async def query_triples(self, query):
|
||||
# Validate user/collection parameters
|
||||
if not query.user or not query.collection:
|
||||
raise ValueError("User and collection must be specified")
|
||||
|
||||
# All queries must include user/collection filters
|
||||
# ... rest of implementation
|
||||
```
|
||||
|
||||
#### Preventing Parameter Injection
|
||||
- Use parameterized queries exclusively
|
||||
- Validate user/collection values against allowed patterns
|
||||
- Consider sanitization for Neo4j property name requirements
|
||||
|
||||
#### Audit Trail
|
||||
```python
|
||||
logger.info(f"Query executed - User: {query.user}, Collection: {query.collection}, "
|
||||
f"Pattern: {query.s}/{query.p}/{query.o}")
|
||||
```
|
||||
|
||||
## Alternative Approaches Considered
|
||||
|
||||
### Option 2: Label-Based Isolation
|
||||
|
||||
**Approach**: Use dynamic labels like `User_john_Collection_prod`
|
||||
|
||||
**Pros:**
|
||||
- Strong isolation through label filtering
|
||||
- Efficient query performance with label indexes
|
||||
- Clear data separation
|
||||
|
||||
**Cons:**
|
||||
- Neo4j has practical limits on number of labels (~1000s)
|
||||
- Complex label name generation and sanitization
|
||||
- Difficult to query across collections when needed
|
||||
|
||||
**Implementation Example:**
|
||||
```cypher
|
||||
CREATE (n:Node:User_john_Collection_prod {uri: "http://example.com/entity"})
|
||||
MATCH (n:User_john_Collection_prod) WHERE n:Node RETURN n
|
||||
```
|
||||
|
||||
### Option 3: Database-Per-User
|
||||
|
||||
**Approach**: Create separate Neo4j databases for each user or user/collection combination
|
||||
|
||||
**Pros:**
|
||||
- Complete data isolation
|
||||
- No risk of cross-contamination
|
||||
- Independent scaling per user
|
||||
|
||||
**Cons:**
|
||||
- Resource overhead (each database consumes memory)
|
||||
- Complex database lifecycle management
|
||||
- Neo4j Community Edition database limits
|
||||
- Difficult cross-user analytics
|
||||
|
||||
### Option 4: Composite Key Strategy
|
||||
|
||||
**Approach**: Prefix all URIs and values with user/collection information
|
||||
|
||||
**Pros:**
|
||||
- Backwards compatible with existing queries
|
||||
- Simple implementation
|
||||
- No schema changes required
|
||||
|
||||
**Cons:**
|
||||
- URI pollution affects data semantics
|
||||
- Less efficient queries (string prefix matching)
|
||||
- Breaks RDF/semantic web standards
|
||||
|
||||
**Implementation Example:**
|
||||
```python
|
||||
def make_composite_uri(uri, user, collection):
|
||||
return f"usr:{user}:col:{collection}:uri:{uri}"
|
||||
```
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Foundation (Week 1)
|
||||
1. [ ] Update storage service to accept and store user/collection properties
|
||||
2. [ ] Add compound indexes for efficient querying
|
||||
3. [ ] Implement backwards compatibility layer
|
||||
4. [ ] Create unit tests for new functionality
|
||||
|
||||
### Phase 2: Query Updates (Week 2)
|
||||
1. [ ] Update all query patterns to include user/collection filters
|
||||
2. [ ] Add query validation and security checks
|
||||
3. [ ] Update integration tests
|
||||
4. [ ] Performance testing with filtered queries
|
||||
|
||||
### Phase 3: Migration & Deployment (Week 3)
|
||||
1. [ ] Create data migration scripts for existing Neo4j instances
|
||||
2. [ ] Deployment documentation and runbooks
|
||||
3. [ ] Monitoring and alerting for isolation violations
|
||||
4. [ ] End-to-end testing with multiple users/collections
|
||||
|
||||
### Phase 4: Hardening (Week 4)
|
||||
1. [ ] Remove legacy compatibility mode
|
||||
2. [ ] Add comprehensive audit logging
|
||||
3. [ ] Security review and penetration testing
|
||||
4. [ ] Performance optimization
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
```python
|
||||
def test_user_collection_isolation():
|
||||
# Store triples for user1/collection1
|
||||
processor.store_triples(triples_user1_coll1)
|
||||
|
||||
# Store triples for user2/collection2
|
||||
processor.store_triples(triples_user2_coll2)
|
||||
|
||||
# Query as user1 should only return user1's data
|
||||
results = processor.query_triples(query_user1_coll1)
|
||||
assert all_results_belong_to_user1_coll1(results)
|
||||
|
||||
# Query as user2 should only return user2's data
|
||||
results = processor.query_triples(query_user2_coll2)
|
||||
assert all_results_belong_to_user2_coll2(results)
|
||||
```
|
||||
|
||||
### Integration Tests
|
||||
- Multi-user scenarios with overlapping data
|
||||
- Cross-collection queries (should fail)
|
||||
- Migration testing with existing data
|
||||
- Performance benchmarks with large datasets
|
||||
|
||||
### Security Tests
|
||||
- Attempt to query other users' data
|
||||
- SQL injection style attacks on user/collection parameters
|
||||
- Verify complete isolation under various query patterns
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Index Strategy
|
||||
- Compound indexes on `(user, collection, uri)` for optimal filtering
|
||||
- Consider partial indexes if some collections are much larger
|
||||
- Monitor index usage and query performance
|
||||
|
||||
### Query Optimization
|
||||
- Use EXPLAIN to verify index usage in filtered queries
|
||||
- Consider query result caching for frequently accessed data
|
||||
- Profile memory usage with large numbers of users/collections
|
||||
|
||||
### Scalability
|
||||
- Each user/collection combination creates separate data islands
|
||||
- Monitor database size and connection pool usage
|
||||
- Consider horizontal scaling strategies if needed
|
||||
|
||||
## Security & Compliance
|
||||
|
||||
### Data Isolation Guarantees
|
||||
- **Physical**: All user data stored with explicit user/collection properties
|
||||
- **Logical**: All queries filtered by user/collection context
|
||||
- **Access Control**: Service-level validation prevents unauthorized access
|
||||
|
||||
### Audit Requirements
|
||||
- Log all data access with user/collection context
|
||||
- Track migration activities and data movements
|
||||
- Monitor for isolation violation attempts
|
||||
|
||||
### Compliance Considerations
|
||||
- GDPR: Enhanced ability to locate and delete user-specific data
|
||||
- SOC2: Clear data isolation and access controls
|
||||
- HIPAA: Strong tenant isolation for healthcare data
|
||||
|
||||
## Risks & Mitigations
|
||||
|
||||
| Risk | Impact | Likelihood | Mitigation |
|
||||
|------|--------|------------|------------|
|
||||
| Query missing user/collection filter | High | Medium | Mandatory validation, comprehensive testing |
|
||||
| Performance degradation | Medium | Low | Index optimization, query profiling |
|
||||
| Migration data corruption | High | Low | Backup strategy, rollback procedures |
|
||||
| Complex multi-collection queries | Medium | Medium | Document query patterns, provide examples |
|
||||
|
||||
## Success Criteria
|
||||
|
||||
1. **Security**: Zero cross-user data access in production
|
||||
2. **Performance**: <10% query performance impact vs unfiltered queries
|
||||
3. **Migration**: 100% existing data successfully migrated with zero loss
|
||||
4. **Usability**: All existing query patterns work with user/collection context
|
||||
5. **Compliance**: Full audit trail of user/collection data access
|
||||
|
||||
## Conclusion
|
||||
|
||||
The property-based filtering approach provides the best balance of security, performance, and maintainability for adding user/collection isolation to Neo4j. It aligns with TrustGraph's existing multi-tenancy patterns while leveraging Neo4j's strengths in graph querying and indexing.
|
||||
|
||||
This solution ensures TrustGraph's Neo4j backend meets the same security standards as other storage backends, preventing data isolation vulnerabilities while maintaining the flexibility and power of graph queries.
|
||||
559
docs/tech-specs/structured-data-descriptor.md
Normal file
559
docs/tech-specs/structured-data-descriptor.md
Normal file
|
|
@ -0,0 +1,559 @@
|
|||
# Structured Data Descriptor Specification
|
||||
|
||||
## Overview
|
||||
|
||||
The Structured Data Descriptor is a JSON-based configuration language that describes how to parse, transform, and import structured data into TrustGraph. It provides a declarative approach to data ingestion, supporting multiple input formats and complex transformation pipelines without requiring custom code.
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### 1. Format Definition
|
||||
Describes the input file type and parsing options. Determines which parser to use and how to interpret the source data.
|
||||
|
||||
### 2. Field Mappings
|
||||
Maps source paths to target fields with transformations. Defines how data flows from input sources to output schema fields.
|
||||
|
||||
### 3. Transform Pipeline
|
||||
Chain of data transformations that can be applied to field values, including:
|
||||
- Data cleaning (trim, normalize)
|
||||
- Format conversion (date parsing, type casting)
|
||||
- Calculations (arithmetic, string manipulation)
|
||||
- Lookups (reference tables, substitutions)
|
||||
|
||||
### 4. Validation Rules
|
||||
Data quality checks applied to ensure data integrity:
|
||||
- Type validation
|
||||
- Range checks
|
||||
- Pattern matching (regex)
|
||||
- Required field validation
|
||||
- Custom validation logic
|
||||
|
||||
### 5. Global Settings
|
||||
Configuration that applies across the entire import process:
|
||||
- Lookup tables for data enrichment
|
||||
- Global variables and constants
|
||||
- Output format specifications
|
||||
- Error handling policies
|
||||
|
||||
## Implementation Strategy
|
||||
|
||||
The importer implementation follows this pipeline:
|
||||
|
||||
1. **Parse Configuration** - Load and validate the JSON descriptor
|
||||
2. **Initialize Parser** - Load appropriate parser (CSV, XML, JSON, etc.) based on `format.type`
|
||||
3. **Apply Preprocessing** - Execute global filters and transformations
|
||||
4. **Process Records** - For each input record:
|
||||
- Extract data using source paths (JSONPath, XPath, column names)
|
||||
- Apply field-level transforms in sequence
|
||||
- Validate results against defined rules
|
||||
- Apply default values for missing data
|
||||
5. **Apply Postprocessing** - Execute deduplication, aggregation, etc.
|
||||
6. **Generate Output** - Produce data in specified target format
|
||||
|
||||
## Path Expression Support
|
||||
|
||||
Different input formats use appropriate path expression languages:
|
||||
|
||||
- **CSV**: Column names or indices (`"column_name"` or `"[2]"`)
|
||||
- **JSON**: JSONPath syntax (`"$.user.profile.email"`)
|
||||
- **XML**: XPath expressions (`"//product[@id='123']/price"`)
|
||||
- **Fixed-width**: Field names from field definitions
|
||||
|
||||
## Benefits
|
||||
|
||||
- **Single Codebase** - One importer handles multiple input formats
|
||||
- **User-Friendly** - Non-technical users can create configurations
|
||||
- **Reusable** - Configurations can be shared and versioned
|
||||
- **Flexible** - Complex transformations without custom coding
|
||||
- **Robust** - Built-in validation and comprehensive error handling
|
||||
- **Maintainable** - Declarative approach reduces implementation complexity
|
||||
|
||||
## Language Specification
|
||||
|
||||
The Structured Data Descriptor uses a JSON configuration format with the following top-level structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"version": "1.0",
|
||||
"metadata": {
|
||||
"name": "Configuration Name",
|
||||
"description": "Description of what this config does",
|
||||
"author": "Author Name",
|
||||
"created": "2024-01-01T00:00:00Z"
|
||||
},
|
||||
"format": { ... },
|
||||
"globals": { ... },
|
||||
"preprocessing": [ ... ],
|
||||
"mappings": [ ... ],
|
||||
"postprocessing": [ ... ],
|
||||
"output": { ... }
|
||||
}
|
||||
```
|
||||
|
||||
### Format Definition
|
||||
|
||||
Describes the input data format and parsing options:
|
||||
|
||||
```json
|
||||
{
|
||||
"format": {
|
||||
"type": "csv|json|xml|fixed-width|excel|parquet",
|
||||
"encoding": "utf-8",
|
||||
"options": {
|
||||
// Format-specific options
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### CSV Format Options
|
||||
```json
|
||||
{
|
||||
"format": {
|
||||
"type": "csv",
|
||||
"options": {
|
||||
"delimiter": ",",
|
||||
"quote_char": "\"",
|
||||
"escape_char": "\\",
|
||||
"skip_rows": 1,
|
||||
"has_header": true,
|
||||
"null_values": ["", "NULL", "null", "N/A"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### JSON Format Options
|
||||
```json
|
||||
{
|
||||
"format": {
|
||||
"type": "json",
|
||||
"options": {
|
||||
"root_path": "$.data",
|
||||
"array_mode": "records|single",
|
||||
"flatten": false
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### XML Format Options
|
||||
```json
|
||||
{
|
||||
"format": {
|
||||
"type": "xml",
|
||||
"options": {
|
||||
"root_element": "//records/record",
|
||||
"namespaces": {
|
||||
"ns": "http://example.com/namespace"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Global Settings
|
||||
|
||||
Define lookup tables, variables, and global configuration:
|
||||
|
||||
```json
|
||||
{
|
||||
"globals": {
|
||||
"variables": {
|
||||
"current_date": "2024-01-01",
|
||||
"batch_id": "BATCH_001",
|
||||
"default_confidence": 0.8
|
||||
},
|
||||
"lookup_tables": {
|
||||
"country_codes": {
|
||||
"US": "United States",
|
||||
"UK": "United Kingdom",
|
||||
"CA": "Canada"
|
||||
},
|
||||
"status_mapping": {
|
||||
"1": "active",
|
||||
"0": "inactive"
|
||||
}
|
||||
},
|
||||
"constants": {
|
||||
"source_system": "legacy_crm",
|
||||
"import_type": "full"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Field Mappings
|
||||
|
||||
Define how source data maps to target fields with transformations:
|
||||
|
||||
```json
|
||||
{
|
||||
"mappings": [
|
||||
{
|
||||
"target_field": "person_name",
|
||||
"source": "$.name",
|
||||
"transforms": [
|
||||
{"type": "trim"},
|
||||
{"type": "title_case"},
|
||||
{"type": "required"}
|
||||
],
|
||||
"validation": [
|
||||
{"type": "min_length", "value": 2},
|
||||
{"type": "max_length", "value": 100},
|
||||
{"type": "pattern", "value": "^[A-Za-z\\s]+$"}
|
||||
]
|
||||
},
|
||||
{
|
||||
"target_field": "age",
|
||||
"source": "$.age",
|
||||
"transforms": [
|
||||
{"type": "to_int"},
|
||||
{"type": "default", "value": 0}
|
||||
],
|
||||
"validation": [
|
||||
{"type": "range", "min": 0, "max": 150}
|
||||
]
|
||||
},
|
||||
{
|
||||
"target_field": "country",
|
||||
"source": "$.country_code",
|
||||
"transforms": [
|
||||
{"type": "lookup", "table": "country_codes"},
|
||||
{"type": "default", "value": "Unknown"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Transform Types
|
||||
|
||||
Available transformation functions:
|
||||
|
||||
#### String Transforms
|
||||
```json
|
||||
{"type": "trim"},
|
||||
{"type": "upper"},
|
||||
{"type": "lower"},
|
||||
{"type": "title_case"},
|
||||
{"type": "replace", "pattern": "old", "replacement": "new"},
|
||||
{"type": "regex_replace", "pattern": "\\d+", "replacement": "XXX"},
|
||||
{"type": "substring", "start": 0, "end": 10},
|
||||
{"type": "pad_left", "length": 10, "char": "0"}
|
||||
```
|
||||
|
||||
#### Type Conversions
|
||||
```json
|
||||
{"type": "to_string"},
|
||||
{"type": "to_int"},
|
||||
{"type": "to_float"},
|
||||
{"type": "to_bool"},
|
||||
{"type": "to_date", "format": "YYYY-MM-DD"},
|
||||
{"type": "parse_json"}
|
||||
```
|
||||
|
||||
#### Data Operations
|
||||
```json
|
||||
{"type": "default", "value": "default_value"},
|
||||
{"type": "lookup", "table": "table_name"},
|
||||
{"type": "concat", "values": ["field1", " - ", "field2"]},
|
||||
{"type": "calculate", "expression": "${field1} + ${field2}"},
|
||||
{"type": "conditional", "condition": "${age} > 18", "true_value": "adult", "false_value": "minor"}
|
||||
```
|
||||
|
||||
### Validation Rules
|
||||
|
||||
Data quality checks with configurable error handling:
|
||||
|
||||
#### Basic Validations
|
||||
```json
|
||||
{"type": "required"},
|
||||
{"type": "not_null"},
|
||||
{"type": "min_length", "value": 5},
|
||||
{"type": "max_length", "value": 100},
|
||||
{"type": "range", "min": 0, "max": 1000},
|
||||
{"type": "pattern", "value": "^[A-Z]{2,3}$"},
|
||||
{"type": "in_list", "values": ["active", "inactive", "pending"]}
|
||||
```
|
||||
|
||||
#### Custom Validations
|
||||
```json
|
||||
{
|
||||
"type": "custom",
|
||||
"expression": "${age} >= 18 && ${country} == 'US'",
|
||||
"message": "Must be 18+ and in US"
|
||||
},
|
||||
{
|
||||
"type": "cross_field",
|
||||
"fields": ["start_date", "end_date"],
|
||||
"expression": "${start_date} < ${end_date}",
|
||||
"message": "Start date must be before end date"
|
||||
}
|
||||
```
|
||||
|
||||
### Preprocessing and Postprocessing
|
||||
|
||||
Global operations applied before/after field mapping:
|
||||
|
||||
```json
|
||||
{
|
||||
"preprocessing": [
|
||||
{
|
||||
"type": "filter",
|
||||
"condition": "${status} != 'deleted'"
|
||||
},
|
||||
{
|
||||
"type": "sort",
|
||||
"field": "created_date",
|
||||
"order": "asc"
|
||||
}
|
||||
],
|
||||
"postprocessing": [
|
||||
{
|
||||
"type": "deduplicate",
|
||||
"key_fields": ["email", "phone"]
|
||||
},
|
||||
{
|
||||
"type": "aggregate",
|
||||
"group_by": ["country"],
|
||||
"functions": {
|
||||
"total_count": {"type": "count"},
|
||||
"avg_age": {"type": "avg", "field": "age"}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Output Configuration
|
||||
|
||||
Define how processed data should be output:
|
||||
|
||||
```json
|
||||
{
|
||||
"output": {
|
||||
"format": "trustgraph-objects",
|
||||
"schema_name": "person",
|
||||
"options": {
|
||||
"batch_size": 1000,
|
||||
"confidence": 0.9,
|
||||
"source_span_field": "raw_text",
|
||||
"metadata": {
|
||||
"source": "crm_import",
|
||||
"version": "1.0"
|
||||
}
|
||||
},
|
||||
"error_handling": {
|
||||
"on_validation_error": "skip|fail|log",
|
||||
"on_transform_error": "skip|fail|default",
|
||||
"max_errors": 100,
|
||||
"error_output": "errors.json"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Complete Example
|
||||
|
||||
```json
|
||||
{
|
||||
"version": "1.0",
|
||||
"metadata": {
|
||||
"name": "Customer Import from CRM CSV",
|
||||
"description": "Imports customer data from legacy CRM system",
|
||||
"author": "Data Team",
|
||||
"created": "2024-01-01T00:00:00Z"
|
||||
},
|
||||
"format": {
|
||||
"type": "csv",
|
||||
"encoding": "utf-8",
|
||||
"options": {
|
||||
"delimiter": ",",
|
||||
"has_header": true,
|
||||
"skip_rows": 1
|
||||
}
|
||||
},
|
||||
"globals": {
|
||||
"variables": {
|
||||
"import_date": "2024-01-01",
|
||||
"default_confidence": 0.85
|
||||
},
|
||||
"lookup_tables": {
|
||||
"country_codes": {
|
||||
"US": "United States",
|
||||
"CA": "Canada",
|
||||
"UK": "United Kingdom"
|
||||
}
|
||||
}
|
||||
},
|
||||
"preprocessing": [
|
||||
{
|
||||
"type": "filter",
|
||||
"condition": "${status} == 'active'"
|
||||
}
|
||||
],
|
||||
"mappings": [
|
||||
{
|
||||
"target_field": "full_name",
|
||||
"source": "customer_name",
|
||||
"transforms": [
|
||||
{"type": "trim"},
|
||||
{"type": "title_case"}
|
||||
],
|
||||
"validation": [
|
||||
{"type": "required"},
|
||||
{"type": "min_length", "value": 2}
|
||||
]
|
||||
},
|
||||
{
|
||||
"target_field": "email",
|
||||
"source": "email_address",
|
||||
"transforms": [
|
||||
{"type": "trim"},
|
||||
{"type": "lower"}
|
||||
],
|
||||
"validation": [
|
||||
{"type": "pattern", "value": "^[\\w.-]+@[\\w.-]+\\.[a-zA-Z]{2,}$"}
|
||||
]
|
||||
},
|
||||
{
|
||||
"target_field": "age",
|
||||
"source": "age",
|
||||
"transforms": [
|
||||
{"type": "to_int"},
|
||||
{"type": "default", "value": 0}
|
||||
],
|
||||
"validation": [
|
||||
{"type": "range", "min": 0, "max": 120}
|
||||
]
|
||||
},
|
||||
{
|
||||
"target_field": "country",
|
||||
"source": "country_code",
|
||||
"transforms": [
|
||||
{"type": "lookup", "table": "country_codes"},
|
||||
{"type": "default", "value": "Unknown"}
|
||||
]
|
||||
}
|
||||
],
|
||||
"output": {
|
||||
"format": "trustgraph-objects",
|
||||
"schema_name": "customer",
|
||||
"options": {
|
||||
"confidence": "${default_confidence}",
|
||||
"batch_size": 500
|
||||
},
|
||||
"error_handling": {
|
||||
"on_validation_error": "log",
|
||||
"max_errors": 50
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## LLM Prompt for Descriptor Generation
|
||||
|
||||
The following prompt can be used to have an LLM analyze sample data and generate a descriptor configuration:
|
||||
|
||||
```
|
||||
I need you to analyze the provided data sample and create a Structured Data Descriptor configuration in JSON format.
|
||||
|
||||
The descriptor should follow this specification:
|
||||
- version: "1.0"
|
||||
- metadata: Configuration name, description, author, and creation date
|
||||
- format: Input format type and parsing options
|
||||
- globals: Variables, lookup tables, and constants
|
||||
- preprocessing: Filters and transformations applied before mapping
|
||||
- mappings: Field-by-field mapping from source to target with transformations and validations
|
||||
- postprocessing: Operations like deduplication or aggregation
|
||||
- output: Target format and error handling configuration
|
||||
|
||||
ANALYZE THE DATA:
|
||||
1. Identify the format (CSV, JSON, XML, etc.)
|
||||
2. Detect delimiters, encodings, and structure
|
||||
3. Find data types for each field
|
||||
4. Identify patterns and constraints
|
||||
5. Look for fields that need cleaning or transformation
|
||||
6. Find relationships between fields
|
||||
7. Identify lookup opportunities (codes that map to values)
|
||||
8. Detect required vs optional fields
|
||||
|
||||
CREATE THE DESCRIPTOR:
|
||||
For each field in the sample data:
|
||||
- Map it to an appropriate target field name
|
||||
- Add necessary transformations (trim, case conversion, type casting)
|
||||
- Include appropriate validations (required, patterns, ranges)
|
||||
- Set defaults for missing values
|
||||
|
||||
Include preprocessing if needed:
|
||||
- Filters to exclude invalid records
|
||||
- Sorting requirements
|
||||
|
||||
Include postprocessing if beneficial:
|
||||
- Deduplication on key fields
|
||||
- Aggregation for summary data
|
||||
|
||||
Configure output for TrustGraph:
|
||||
- format: "trustgraph-objects"
|
||||
- schema_name: Based on the data entity type
|
||||
- Appropriate error handling
|
||||
|
||||
DATA SAMPLE:
|
||||
[Insert data sample here]
|
||||
|
||||
ADDITIONAL CONTEXT (optional):
|
||||
- Target schema name: [if known]
|
||||
- Business rules: [any specific requirements]
|
||||
- Data quality issues to address: [known problems]
|
||||
|
||||
Generate a complete, valid Structured Data Descriptor configuration that will properly import this data into TrustGraph. Include comments explaining key decisions.
|
||||
```
|
||||
|
||||
### Example Usage Prompt
|
||||
|
||||
```
|
||||
I need you to analyze the provided data sample and create a Structured Data Descriptor configuration in JSON format.
|
||||
|
||||
[Standard instructions from above...]
|
||||
|
||||
DATA SAMPLE:
|
||||
```csv
|
||||
CustomerID,Name,Email,Age,Country,Status,JoinDate,TotalPurchases
|
||||
1001,"Smith, John",john.smith@email.com,35,US,1,2023-01-15,5420.50
|
||||
1002,"doe, jane",JANE.DOE@GMAIL.COM,28,CA,1,2023-03-22,3200.00
|
||||
1003,"Bob Johnson",bob@,62,UK,0,2022-11-01,0
|
||||
1004,"Alice Chen","alice.chen@company.org",41,US,1,2023-06-10,8900.25
|
||||
1005,,invalid-email,25,XX,1,2024-01-01,100
|
||||
```
|
||||
|
||||
ADDITIONAL CONTEXT:
|
||||
- Target schema name: customer
|
||||
- Business rules: Email should be valid and lowercase, names should be title case
|
||||
- Data quality issues: Some emails are invalid, some names are missing, country codes need mapping
|
||||
```
|
||||
|
||||
### Prompt for Analyzing Existing Data Without Sample
|
||||
|
||||
```
|
||||
I need you to help me create a Structured Data Descriptor configuration for importing [data type] data.
|
||||
|
||||
The source data has these characteristics:
|
||||
- Format: [CSV/JSON/XML/etc]
|
||||
- Fields: [list the fields]
|
||||
- Data quality issues: [describe any known issues]
|
||||
- Volume: [approximate number of records]
|
||||
|
||||
Requirements:
|
||||
- [List any specific transformation needs]
|
||||
- [List any validation requirements]
|
||||
- [List any business rules]
|
||||
|
||||
Please generate a Structured Data Descriptor configuration that will:
|
||||
1. Parse the input format correctly
|
||||
2. Clean and standardize the data
|
||||
3. Validate according to the requirements
|
||||
4. Handle errors gracefully
|
||||
5. Output in TrustGraph ExtractedObject format
|
||||
|
||||
Focus on making the configuration robust and reusable.
|
||||
```
|
||||
|
|
@ -114,7 +114,7 @@ The structured data integration requires the following technical components:
|
|||
|
||||
Module: trustgraph-flow/trustgraph/storage/objects/cassandra
|
||||
|
||||
5. **Structured Query Service**
|
||||
5. **Structured Query Service** ✅ **[COMPLETE]**
|
||||
- Accepts structured queries in defined formats
|
||||
- Executes queries against the structured store
|
||||
- Returns objects matching query criteria
|
||||
|
|
|
|||
273
docs/tech-specs/structured-diag-service.md
Normal file
273
docs/tech-specs/structured-diag-service.md
Normal file
|
|
@ -0,0 +1,273 @@
|
|||
# Structured Data Diagnostic Service Technical Specification
|
||||
|
||||
## Overview
|
||||
|
||||
This specification describes a new invokable service for diagnosing and analyzing structured data within TrustGraph. The service extracts functionality from the existing `tg-load-structured-data` command-line tool and exposes it as a request/response service, enabling programmatic access to data type detection and descriptor generation capabilities.
|
||||
|
||||
The service supports three primary operations:
|
||||
|
||||
1. **Data Type Detection**: Analyze a data sample to determine its format (CSV, JSON, or XML)
|
||||
2. **Descriptor Generation**: Generate a TrustGraph structured data descriptor for a given data sample and type
|
||||
3. **Combined Diagnosis**: Perform both type detection and descriptor generation in sequence
|
||||
|
||||
## Goals
|
||||
|
||||
- **Modularize Data Analysis**: Extract data diagnosis logic from CLI into reusable service components
|
||||
- **Enable Programmatic Access**: Provide API-based access to data analysis capabilities
|
||||
- **Support Multiple Data Formats**: Handle CSV, JSON, and XML data formats consistently
|
||||
- **Generate Accurate Descriptors**: Produce structured data descriptors that accurately map source data to TrustGraph schemas
|
||||
- **Maintain Backward Compatibility**: Ensure existing CLI functionality continues to work
|
||||
- **Enable Service Composition**: Allow other services to leverage data diagnosis capabilities
|
||||
- **Improve Testability**: Separate business logic from CLI interface for better testing
|
||||
- **Support Streaming Analysis**: Enable analysis of data samples without loading entire files
|
||||
|
||||
## Background
|
||||
|
||||
Currently, the `tg-load-structured-data` command provides comprehensive functionality for analyzing structured data and generating descriptors. However, this functionality is tightly coupled to the CLI interface, limiting its reusability.
|
||||
|
||||
Current limitations include:
|
||||
- Data diagnosis logic embedded in CLI code
|
||||
- No programmatic access to type detection and descriptor generation
|
||||
- Difficult to integrate diagnosis capabilities into other services
|
||||
- Limited ability to compose data analysis workflows
|
||||
|
||||
This specification addresses these gaps by creating a dedicated service for structured data diagnosis. By exposing these capabilities as a service, TrustGraph can:
|
||||
- Enable other services to analyze data programmatically
|
||||
- Support more complex data processing pipelines
|
||||
- Facilitate integration with external systems
|
||||
- Improve maintainability through separation of concerns
|
||||
|
||||
## Technical Design
|
||||
|
||||
### Architecture
|
||||
|
||||
The structured data diagnostic service requires the following technical components:
|
||||
|
||||
1. **Diagnostic Service Processor**
|
||||
- Handles incoming diagnosis requests
|
||||
- Orchestrates type detection and descriptor generation
|
||||
- Returns structured responses with diagnosis results
|
||||
|
||||
Module: `trustgraph-flow/trustgraph/diagnosis/structured_data/service.py`
|
||||
|
||||
2. **Data Type Detector**
|
||||
- Uses algorithmic detection to identify data format (CSV, JSON, XML)
|
||||
- Analyzes data structure, delimiters, and syntax patterns
|
||||
- Returns detected format and confidence scores
|
||||
|
||||
Module: `trustgraph-flow/trustgraph/diagnosis/structured_data/type_detector.py`
|
||||
|
||||
3. **Descriptor Generator**
|
||||
- Uses prompt service to generate descriptors
|
||||
- Invokes format-specific prompts (diagnose-csv, diagnose-json, diagnose-xml)
|
||||
- Maps data fields to TrustGraph schema fields through prompt responses
|
||||
|
||||
Module: `trustgraph-flow/trustgraph/diagnosis/structured_data/descriptor_generator.py`
|
||||
|
||||
### Data Models
|
||||
|
||||
#### StructuredDataDiagnosisRequest
|
||||
|
||||
Request message for structured data diagnosis operations:
|
||||
|
||||
```python
|
||||
class StructuredDataDiagnosisRequest:
|
||||
operation: str # "detect-type", "generate-descriptor", or "diagnose"
|
||||
sample: str # Data sample to analyze (text content)
|
||||
type: Optional[str] # Data type (csv, json, xml) - required for generate-descriptor
|
||||
schema_name: Optional[str] # Target schema name for descriptor generation
|
||||
options: Dict[str, Any] # Additional options (e.g., delimiter for CSV)
|
||||
```
|
||||
|
||||
#### StructuredDataDiagnosisResponse
|
||||
|
||||
Response message containing diagnosis results:
|
||||
|
||||
```python
|
||||
class StructuredDataDiagnosisResponse:
|
||||
operation: str # The operation that was performed
|
||||
detected_type: Optional[str] # Detected data type (for detect-type/diagnose)
|
||||
confidence: Optional[float] # Confidence score for type detection
|
||||
descriptor: Optional[Dict] # Generated descriptor (for generate-descriptor/diagnose)
|
||||
error: Optional[str] # Error message if operation failed
|
||||
metadata: Dict[str, Any] # Additional metadata (e.g., field count, sample records)
|
||||
```
|
||||
|
||||
#### Descriptor Structure
|
||||
|
||||
The generated descriptor follows the existing structured data descriptor format:
|
||||
|
||||
```json
|
||||
{
|
||||
"format": {
|
||||
"type": "csv",
|
||||
"encoding": "utf-8",
|
||||
"options": {
|
||||
"delimiter": ",",
|
||||
"has_header": true
|
||||
}
|
||||
},
|
||||
"mappings": [
|
||||
{
|
||||
"source_field": "customer_id",
|
||||
"target_field": "id",
|
||||
"transforms": [
|
||||
{"type": "trim"}
|
||||
]
|
||||
}
|
||||
],
|
||||
"output": {
|
||||
"schema_name": "customer",
|
||||
"options": {
|
||||
"batch_size": 1000,
|
||||
"confidence": 0.9
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Service Interface
|
||||
|
||||
The service will expose the following operations through the request/response pattern:
|
||||
|
||||
1. **Type Detection Operation**
|
||||
- Input: Data sample
|
||||
- Processing: Analyze data structure using algorithmic detection
|
||||
- Output: Detected type with confidence score
|
||||
|
||||
2. **Descriptor Generation Operation**
|
||||
- Input: Data sample, type, target schema name
|
||||
- Processing:
|
||||
- Call prompt service with format-specific prompt ID (diagnose-csv, diagnose-json, or diagnose-xml)
|
||||
- Pass data sample and available schemas to prompt
|
||||
- Receive generated descriptor from prompt response
|
||||
- Output: Structured data descriptor
|
||||
|
||||
3. **Combined Diagnosis Operation**
|
||||
- Input: Data sample, optional schema name
|
||||
- Processing:
|
||||
- Use algorithmic detection to identify format first
|
||||
- Select appropriate format-specific prompt based on detected type
|
||||
- Call prompt service to generate descriptor
|
||||
- Output: Both detected type and descriptor
|
||||
|
||||
### Implementation Details
|
||||
|
||||
The service will follow TrustGraph service conventions:
|
||||
|
||||
1. **Service Registration**
|
||||
- Register as `structured-diag` service type
|
||||
- Use standard request/response topics
|
||||
- Implement FlowProcessor base class
|
||||
- Register PromptClientSpec for prompt service interaction
|
||||
|
||||
2. **Configuration Management**
|
||||
- Access schema configurations via config service
|
||||
- Cache schemas for performance
|
||||
- Handle configuration updates dynamically
|
||||
|
||||
3. **Prompt Integration**
|
||||
- Use existing prompt service infrastructure
|
||||
- Call prompt service with format-specific prompt IDs:
|
||||
- `diagnose-csv`: For CSV data analysis
|
||||
- `diagnose-json`: For JSON data analysis
|
||||
- `diagnose-xml`: For XML data analysis
|
||||
- Prompts are configured in prompt config, not hard-coded in service
|
||||
- Pass schemas and data samples as prompt variables
|
||||
- Parse prompt responses to extract descriptors
|
||||
|
||||
4. **Error Handling**
|
||||
- Validate input data samples
|
||||
- Provide descriptive error messages
|
||||
- Handle malformed data gracefully
|
||||
- Handle prompt service failures
|
||||
|
||||
5. **Data Sampling**
|
||||
- Process configurable sample sizes
|
||||
- Handle incomplete records appropriately
|
||||
- Maintain sampling consistency
|
||||
|
||||
### API Integration
|
||||
|
||||
The service will integrate with existing TrustGraph APIs:
|
||||
|
||||
Modified Components:
|
||||
- `tg-load-structured-data` CLI - Refactored to use the new service for diagnosis operations
|
||||
- Flow API - Extended to support structured data diagnosis requests
|
||||
|
||||
New Service Endpoints:
|
||||
- `/api/v1/flow/{flow}/diagnose/structured-data` - WebSocket endpoint for diagnosis requests
|
||||
- `/api/v1/diagnose/structured-data` - REST endpoint for synchronous diagnosis
|
||||
|
||||
### Message Flow
|
||||
|
||||
```
|
||||
Client → Gateway → Structured Diag Service → Config Service (for schemas)
|
||||
↓
|
||||
Type Detector (algorithmic)
|
||||
↓
|
||||
Prompt Service (diagnose-csv/json/xml)
|
||||
↓
|
||||
Descriptor Generator (parses prompt response)
|
||||
↓
|
||||
Client ← Gateway ← Structured Diag Service (response)
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- Input validation to prevent injection attacks
|
||||
- Size limits on data samples to prevent DoS
|
||||
- Sanitization of generated descriptors
|
||||
- Access control through existing TrustGraph authentication
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- Cache schema definitions to reduce config service calls
|
||||
- Limit sample sizes to maintain responsive performance
|
||||
- Use streaming processing for large data samples
|
||||
- Implement timeout mechanisms for long-running analyses
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
1. **Unit Tests**
|
||||
- Type detection for various data formats
|
||||
- Descriptor generation accuracy
|
||||
- Error handling scenarios
|
||||
|
||||
2. **Integration Tests**
|
||||
- Service request/response flow
|
||||
- Schema retrieval and caching
|
||||
- CLI integration
|
||||
|
||||
3. **Performance Tests**
|
||||
- Large sample processing
|
||||
- Concurrent request handling
|
||||
- Memory usage under load
|
||||
|
||||
## Migration Plan
|
||||
|
||||
1. **Phase 1**: Implement service with core functionality
|
||||
2. **Phase 2**: Refactor CLI to use service (maintain backward compatibility)
|
||||
3. **Phase 3**: Add REST API endpoints
|
||||
4. **Phase 4**: Deprecate embedded CLI logic (with notice period)
|
||||
|
||||
## Timeline
|
||||
|
||||
- Week 1-2: Implement core service and type detection
|
||||
- Week 3-4: Add descriptor generation and integration
|
||||
- Week 5: Testing and documentation
|
||||
- Week 6: CLI refactoring and migration
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Should the service support additional data formats (e.g., Parquet, Avro)?
|
||||
- What should be the maximum sample size for analysis?
|
||||
- Should diagnosis results be cached for repeated requests?
|
||||
- How should the service handle multi-schema scenarios?
|
||||
- Should the prompt IDs be configurable parameters for the service?
|
||||
|
||||
## References
|
||||
|
||||
- [Structured Data Descriptor Specification](structured-data-descriptor.md)
|
||||
- [Structured Data Loading Documentation](structured-data.md)
|
||||
- `tg-load-structured-data` implementation: `trustgraph-cli/trustgraph/cli/load_structured_data.py`
|
||||
491
docs/tech-specs/tool-group.md
Normal file
491
docs/tech-specs/tool-group.md
Normal file
|
|
@ -0,0 +1,491 @@
|
|||
# TrustGraph Tool Group System
|
||||
## Technical Specification v1.0
|
||||
|
||||
### Executive Summary
|
||||
|
||||
This specification defines a tool grouping system for TrustGraph agents that allows fine-grained control over which tools are available for specific requests. The system introduces group-based tool filtering through configuration and request-level specification, enabling better security boundaries, resource management, and functional partitioning of agent capabilities.
|
||||
|
||||
### 1. Overview
|
||||
|
||||
#### 1.1 Problem Statement
|
||||
|
||||
Currently, TrustGraph agents have access to all configured tools regardless of request context or security requirements. This creates several challenges:
|
||||
|
||||
- **Security Risk**: Sensitive tools (e.g., data modification) are available even for read-only queries
|
||||
- **Resource Waste**: Complex tools are loaded even when simple queries don't require them
|
||||
- **Functional Confusion**: Agents may select inappropriate tools when simpler alternatives exist
|
||||
- **Multi-tenant Isolation**: Different user groups need access to different tool sets
|
||||
|
||||
#### 1.2 Solution Overview
|
||||
|
||||
The tool group system introduces:
|
||||
|
||||
1. **Group Classification**: Tools are tagged with group memberships during configuration
|
||||
2. **Request-level Filtering**: AgentRequest specifies which tool groups are permitted
|
||||
3. **Runtime Enforcement**: Agents only have access to tools matching the requested groups
|
||||
4. **Flexible Grouping**: Tools can belong to multiple groups for complex scenarios
|
||||
|
||||
### 2. Schema Changes
|
||||
|
||||
#### 2.1 Tool Configuration Schema Enhancement
|
||||
|
||||
The existing tool configuration is enhanced with a `group` field:
|
||||
|
||||
**Before:**
|
||||
```json
|
||||
{
|
||||
"name": "knowledge-query",
|
||||
"type": "knowledge-query",
|
||||
"description": "Query the knowledge graph"
|
||||
}
|
||||
```
|
||||
|
||||
**After:**
|
||||
```json
|
||||
{
|
||||
"name": "knowledge-query",
|
||||
"type": "knowledge-query",
|
||||
"description": "Query the knowledge graph",
|
||||
"group": ["read-only", "knowledge", "basic"]
|
||||
}
|
||||
```
|
||||
|
||||
**Group Field Specification:**
|
||||
- `group`: Array(String) - List of groups this tool belongs to
|
||||
- **Optional**: Tools without group field belong to "default" group
|
||||
- **Multi-membership**: Tools can belong to multiple groups
|
||||
- **Case-sensitive**: Group names are exact string matches
|
||||
|
||||
#### 2.1.2 Tool State Transition Enhancement
|
||||
|
||||
Tools can optionally specify state transitions and state-based availability:
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "knowledge-query",
|
||||
"type": "knowledge-query",
|
||||
"description": "Query the knowledge graph",
|
||||
"group": ["read-only", "knowledge", "basic"],
|
||||
"state": "analysis",
|
||||
"available_in_states": ["undefined", "research"]
|
||||
}
|
||||
```
|
||||
|
||||
**State Field Specification:**
|
||||
- `state`: String - **Optional** - State to transition to after successful tool execution
|
||||
- `available_in_states`: Array(String) - **Optional** - States in which this tool is available
|
||||
- **Default behavior**: Tools without `available_in_states` are available in all states
|
||||
- **State transition**: Only occurs after successful tool execution
|
||||
|
||||
#### 2.2 AgentRequest Schema Enhancement
|
||||
|
||||
The `AgentRequest` schema in `trustgraph-base/trustgraph/schema/services/agent.py` is enhanced:
|
||||
|
||||
**Current AgentRequest:**
|
||||
- `question`: String - User query
|
||||
- `plan`: String - Execution plan (can be removed)
|
||||
- `state`: String - Agent state
|
||||
- `history`: Array(AgentStep) - Execution history
|
||||
|
||||
**Enhanced AgentRequest:**
|
||||
- `question`: String - User query
|
||||
- `state`: String - Agent execution state (now actively used for tool filtering)
|
||||
- `history`: Array(AgentStep) - Execution history
|
||||
- `group`: Array(String) - **NEW** - Tool groups allowed for this request
|
||||
|
||||
**Schema Changes:**
|
||||
- **Removed**: `plan` field is no longer needed and can be removed (was originally intended for tool specification)
|
||||
- **Added**: `group` field for tool group specification
|
||||
- **Enhanced**: `state` field now controls tool availability during execution
|
||||
|
||||
**Field Behaviors:**
|
||||
|
||||
**Group Field:**
|
||||
- **Optional**: If not specified, defaults to ["default"]
|
||||
- **Intersection**: Only tools matching at least one specified group are available
|
||||
- **Empty array**: No tools available (agent can only use internal reasoning)
|
||||
- **Wildcard**: Special group "*" grants access to all tools
|
||||
|
||||
**State Field:**
|
||||
- **Optional**: If not specified, defaults to "undefined"
|
||||
- **State-based filtering**: Only tools available in current state are eligible
|
||||
- **Default state**: "undefined" state allows all tools (subject to group filtering)
|
||||
- **State transitions**: Tools can change state after successful execution
|
||||
|
||||
### 3. Custom Group Examples
|
||||
|
||||
Organizations can define domain-specific groups:
|
||||
|
||||
```json
|
||||
{
|
||||
"financial-tools": ["stock-query", "portfolio-analysis"],
|
||||
"medical-tools": ["diagnosis-assist", "drug-interaction"],
|
||||
"legal-tools": ["contract-analysis", "case-search"]
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Implementation Details
|
||||
|
||||
#### 4.1 Tool Loading and Filtering
|
||||
|
||||
**Configuration Phase:**
|
||||
1. All tools are loaded from configuration with their group assignments
|
||||
2. Tools without explicit groups are assigned to "default" group
|
||||
3. Group membership is validated and stored in tool registry
|
||||
|
||||
**Request Processing Phase:**
|
||||
1. AgentRequest arrives with optional group specification
|
||||
2. Agent filters available tools based on group intersection
|
||||
3. Only matching tools are passed to agent execution context
|
||||
4. Agent operates with filtered tool set throughout request lifecycle
|
||||
|
||||
#### 4.2 Tool Filtering Logic
|
||||
|
||||
**Combined Group and State Filtering:**
|
||||
|
||||
```
|
||||
For each configured tool:
|
||||
tool_groups = tool.group || ["default"]
|
||||
tool_states = tool.available_in_states || ["*"] // Available in all states
|
||||
|
||||
For each request:
|
||||
requested_groups = request.group || ["default"]
|
||||
current_state = request.state || "undefined"
|
||||
|
||||
Tool is available if:
|
||||
// Group filtering
|
||||
(intersection(tool_groups, requested_groups) is not empty OR "*" in requested_groups)
|
||||
AND
|
||||
// State filtering
|
||||
(current_state in tool_states OR "*" in tool_states)
|
||||
```
|
||||
|
||||
**State Transition Logic:**
|
||||
|
||||
```
|
||||
After successful tool execution:
|
||||
if tool.state is defined:
|
||||
next_request.state = tool.state
|
||||
else:
|
||||
next_request.state = current_request.state // No change
|
||||
```
|
||||
|
||||
#### 4.3 Agent Integration Points
|
||||
|
||||
**ReAct Agent:**
|
||||
- Tool filtering occurs in agent_manager.py during tool registry creation
|
||||
- Available tools list is filtered by both group and state before plan generation
|
||||
- State transitions update AgentRequest.state field after successful tool execution
|
||||
- Next iteration uses updated state for tool filtering
|
||||
|
||||
**Confidence-Based Agent:**
|
||||
- Tool filtering occurs in planner.py during plan generation
|
||||
- ExecutionStep validation ensures only group+state eligible tools are used
|
||||
- Flow controller enforces tool availability at runtime
|
||||
- State transitions managed by Flow Controller between steps
|
||||
|
||||
### 5. Configuration Examples
|
||||
|
||||
#### 5.1 Tool Configuration with Groups and States
|
||||
|
||||
```yaml
|
||||
tool:
|
||||
knowledge-query:
|
||||
type: knowledge-query
|
||||
name: "Knowledge Graph Query"
|
||||
description: "Query the knowledge graph for entities and relationships"
|
||||
group: ["read-only", "knowledge", "basic"]
|
||||
state: "analysis"
|
||||
available_in_states: ["undefined", "research"]
|
||||
|
||||
graph-update:
|
||||
type: graph-update
|
||||
name: "Graph Update"
|
||||
description: "Add or modify entities in the knowledge graph"
|
||||
group: ["write", "knowledge", "admin"]
|
||||
available_in_states: ["analysis", "modification"]
|
||||
|
||||
text-completion:
|
||||
type: text-completion
|
||||
name: "Text Completion"
|
||||
description: "Generate text using language models"
|
||||
group: ["read-only", "text", "basic"]
|
||||
state: "undefined"
|
||||
# No available_in_states = available in all states
|
||||
|
||||
complex-analysis:
|
||||
type: mcp-tool
|
||||
name: "Complex Analysis Tool"
|
||||
description: "Perform complex data analysis"
|
||||
group: ["advanced", "compute", "expensive"]
|
||||
state: "results"
|
||||
available_in_states: ["analysis"]
|
||||
mcp_tool_id: "analysis-server"
|
||||
|
||||
reset-workflow:
|
||||
type: mcp-tool
|
||||
name: "Reset Workflow"
|
||||
description: "Reset to initial state"
|
||||
group: ["admin"]
|
||||
state: "undefined"
|
||||
available_in_states: ["analysis", "results"]
|
||||
```
|
||||
|
||||
#### 5.2 Request Examples with State Workflows
|
||||
|
||||
**Initial Research Request:**
|
||||
```json
|
||||
{
|
||||
"question": "What entities are connected to Company X?",
|
||||
"group": ["read-only", "knowledge"],
|
||||
"state": "undefined"
|
||||
}
|
||||
```
|
||||
*Available tools: knowledge-query, text-completion*
|
||||
*After knowledge-query: state → "analysis"*
|
||||
|
||||
**Analysis Phase:**
|
||||
```json
|
||||
{
|
||||
"question": "Continue analysis based on previous results",
|
||||
"group": ["advanced", "compute", "write"],
|
||||
"state": "analysis"
|
||||
}
|
||||
```
|
||||
*Available tools: complex-analysis, graph-update, reset-workflow*
|
||||
*After complex-analysis: state → "results"*
|
||||
|
||||
**Results Phase:**
|
||||
```json
|
||||
{
|
||||
"question": "What should I do with these results?",
|
||||
"group": ["admin"],
|
||||
"state": "results"
|
||||
}
|
||||
```
|
||||
*Available tools: reset-workflow only*
|
||||
*After reset-workflow: state → "undefined"*
|
||||
|
||||
**Workflow Example - Complete Flow:**
|
||||
1. **Start (undefined)**: Use knowledge-query → transitions to "analysis"
|
||||
2. **Analysis state**: Use complex-analysis → transitions to "results"
|
||||
3. **Results state**: Use reset-workflow → transitions back to "undefined"
|
||||
4. **Back to start**: All initial tools available again
|
||||
|
||||
### 6. Security Considerations
|
||||
|
||||
#### 6.1 Access Control Integration
|
||||
|
||||
**Gateway-Level Filtering:**
|
||||
- Gateway can enforce group restrictions based on user permissions
|
||||
- Prevent elevation of privileges through request manipulation
|
||||
- Audit trail includes requested and granted tool groups
|
||||
|
||||
**Example Gateway Logic:**
|
||||
```
|
||||
user_permissions = get_user_permissions(request.user_id)
|
||||
allowed_groups = user_permissions.tool_groups
|
||||
requested_groups = request.group
|
||||
|
||||
# Validate request doesn't exceed permissions
|
||||
if not is_subset(requested_groups, allowed_groups):
|
||||
reject_request("Insufficient permissions for requested tool groups")
|
||||
```
|
||||
|
||||
#### 6.2 Audit and Monitoring
|
||||
|
||||
**Enhanced Audit Trail:**
|
||||
- Log requested tool groups and initial state per request
|
||||
- Track state transitions and tool usage by group membership
|
||||
- Monitor unauthorized group access attempts and invalid state transitions
|
||||
- Alert on unusual group usage patterns or suspicious state workflows
|
||||
|
||||
### 7. Migration Strategy
|
||||
|
||||
#### 7.1 Backward Compatibility
|
||||
|
||||
**Phase 1: Additive Changes**
|
||||
- Add optional `group` field to tool configurations
|
||||
- Add optional `group` field to AgentRequest schema
|
||||
- Default behavior: All existing tools belong to "default" group
|
||||
- Existing requests without group field use "default" group
|
||||
|
||||
**Existing Behavior Preserved:**
|
||||
- Tools without group configuration continue to work (default group)
|
||||
- Tools without state configuration are available in all states
|
||||
- Requests without group specification access all tools (default group)
|
||||
- Requests without state specification use "undefined" state (all tools available)
|
||||
- No breaking changes to existing deployments
|
||||
|
||||
### 8. Monitoring and Observability
|
||||
|
||||
#### 8.1 New Metrics
|
||||
|
||||
**Tool Group Usage:**
|
||||
- `agent_tool_group_requests_total` - Counter of requests by group
|
||||
- `agent_tool_group_availability` - Gauge of tools available per group
|
||||
- `agent_filtered_tools_count` - Histogram of tool count after group+state filtering
|
||||
|
||||
**State Workflow Metrics:**
|
||||
- `agent_state_transitions_total` - Counter of state transitions by tool
|
||||
- `agent_workflow_duration_seconds` - Histogram of time spent in each state
|
||||
- `agent_state_availability` - Gauge of tools available per state
|
||||
|
||||
**Security Metrics:**
|
||||
- `agent_group_access_denied_total` - Counter of unauthorized group access
|
||||
- `agent_invalid_state_transition_total` - Counter of invalid state transitions
|
||||
- `agent_privilege_escalation_attempts_total` - Counter of suspicious requests
|
||||
|
||||
#### 8.2 Logging Enhancements
|
||||
|
||||
**Request Logging:**
|
||||
```json
|
||||
{
|
||||
"request_id": "req-123",
|
||||
"requested_groups": ["read-only", "knowledge"],
|
||||
"initial_state": "undefined",
|
||||
"state_transitions": [
|
||||
{"tool": "knowledge-query", "from": "undefined", "to": "analysis", "timestamp": "2024-01-01T10:00:01Z"}
|
||||
],
|
||||
"available_tools": ["knowledge-query", "text-completion"],
|
||||
"filtered_by_group": ["graph-update", "admin-tool"],
|
||||
"filtered_by_state": [],
|
||||
"execution_time": "1.2s"
|
||||
}
|
||||
```
|
||||
|
||||
### 9. Testing Strategy
|
||||
|
||||
#### 9.1 Unit Tests
|
||||
|
||||
**Tool Filtering Logic:**
|
||||
- Test group intersection calculations
|
||||
- Test state-based filtering logic
|
||||
- Verify default group and state assignment
|
||||
- Test wildcard group behavior
|
||||
- Validate empty group handling
|
||||
- Test combined group+state filtering scenarios
|
||||
|
||||
**Configuration Validation:**
|
||||
- Test tool loading with various group and state configurations
|
||||
- Verify schema validation for invalid group and state specifications
|
||||
- Test backward compatibility with existing configurations
|
||||
- Validate state transition definitions and cycles
|
||||
|
||||
#### 9.2 Integration Tests
|
||||
|
||||
**Agent Behavior:**
|
||||
- Verify agents only see group+state filtered tools
|
||||
- Test request execution with various group combinations
|
||||
- Test state transitions during agent execution
|
||||
- Validate error handling when no tools are available
|
||||
- Test workflow progression through multiple states
|
||||
|
||||
**Security Testing:**
|
||||
- Test privilege escalation prevention
|
||||
- Verify audit trail accuracy
|
||||
- Test gateway integration with user permissions
|
||||
|
||||
#### 9.3 End-to-End Scenarios
|
||||
|
||||
**Multi-tenant Usage with State Workflows:**
|
||||
```
|
||||
Scenario: Different users with different tool access and workflow states
|
||||
Given: User A has "read-only" permissions, state "undefined"
|
||||
And: User B has "write" permissions, state "analysis"
|
||||
When: Both request knowledge operations
|
||||
Then: User A gets read-only tools available in "undefined" state
|
||||
And: User B gets write tools available in "analysis" state
|
||||
And: State transitions are tracked per user session
|
||||
And: All usage and transitions are properly audited
|
||||
```
|
||||
|
||||
**Workflow State Progression:**
|
||||
```
|
||||
Scenario: Complete workflow execution
|
||||
Given: Request with groups ["knowledge", "compute"] and state "undefined"
|
||||
When: Agent executes knowledge-query tool (transitions to "analysis")
|
||||
And: Agent executes complex-analysis tool (transitions to "results")
|
||||
And: Agent executes reset-workflow tool (transitions to "undefined")
|
||||
Then: Each step has correctly filtered available tools
|
||||
And: State transitions are logged with timestamps
|
||||
And: Final state allows initial workflow to repeat
|
||||
```
|
||||
|
||||
### 10. Performance Considerations
|
||||
|
||||
#### 10.1 Tool Loading Impact
|
||||
|
||||
**Configuration Loading:**
|
||||
- Group and state metadata loaded once at startup
|
||||
- Minimal memory overhead per tool (additional fields)
|
||||
- No impact on tool initialization time
|
||||
|
||||
**Request Processing:**
|
||||
- Combined group+state filtering occurs once per request
|
||||
- O(n) complexity where n = number of configured tools
|
||||
- State transitions add minimal overhead (string assignment)
|
||||
- Negligible impact for typical tool counts (< 100)
|
||||
|
||||
#### 10.2 Optimization Strategies
|
||||
|
||||
**Pre-computed Tool Sets:**
|
||||
- Cache tool sets by group+state combination
|
||||
- Avoid repeated filtering for common group/state patterns
|
||||
- Memory vs computation tradeoff for frequently used combinations
|
||||
|
||||
**Lazy Loading:**
|
||||
- Load tool implementations only when needed
|
||||
- Reduce startup time for deployments with many tools
|
||||
- Dynamic tool registration based on group requirements
|
||||
|
||||
### 11. Future Enhancements
|
||||
|
||||
#### 11.1 Dynamic Group Assignment
|
||||
|
||||
**Context-Aware Grouping:**
|
||||
- Assign tools to groups based on request context
|
||||
- Time-based group availability (business hours only)
|
||||
- Load-based group restrictions (expensive tools during low usage)
|
||||
|
||||
#### 11.2 Group Hierarchies
|
||||
|
||||
**Nested Group Structure:**
|
||||
```json
|
||||
{
|
||||
"knowledge": {
|
||||
"read": ["knowledge-query", "entity-search"],
|
||||
"write": ["graph-update", "entity-create"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### 11.3 Tool Recommendations
|
||||
|
||||
**Group-Based Suggestions:**
|
||||
- Suggest optimal tool groups for request types
|
||||
- Learn from usage patterns to improve recommendations
|
||||
- Provide fallback groups when preferred tools are unavailable
|
||||
|
||||
### 12. Open Questions
|
||||
|
||||
1. **Group Validation**: Should invalid group names in requests cause hard failures or warnings?
|
||||
|
||||
2. **Group Discovery**: Should the system provide an API to list available groups and their tools?
|
||||
|
||||
3. **Dynamic Groups**: Should groups be configurable at runtime or only at startup?
|
||||
|
||||
4. **Group Inheritance**: Should tools inherit groups from their parent categories or implementations?
|
||||
|
||||
5. **Performance Monitoring**: What additional metrics are needed to track group-based tool usage effectively?
|
||||
|
||||
### 13. Conclusion
|
||||
|
||||
The tool group system provides:
|
||||
|
||||
- **Security**: Fine-grained access control over agent capabilities
|
||||
- **Performance**: Reduced tool loading and selection overhead
|
||||
- **Flexibility**: Multi-dimensional tool classification
|
||||
- **Compatibility**: Seamless integration with existing agent architectures
|
||||
|
||||
This system enables TrustGraph deployments to better manage tool access, improve security boundaries, and optimize resource usage while maintaining full backward compatibility with existing configurations and requests.
|
||||
Loading…
Add table
Add a link
Reference in a new issue