Release 1.4 -> master (#524)

Catch up
2026-04-26 08:56:21 +02:00 · 2025-09-20 16:00:37 +01:00 · 2025-09-20 16:00:37 +01:00 · 6c7af8789d
commit 6c7af8789d
parent a8e437fc7f
216 changed files with 31360 additions and 1611 deletions
--- a/docs/tech-specs/neo4j-user-collection-isolation.md
+++ b/docs/tech-specs/neo4j-user-collection-isolation.md
@ -0,0 +1,359 @@
+# Neo4j User/Collection Isolation Support
+
+## Problem Statement
+
+The Neo4j triples storage and query implementation currently lacks user/collection isolation, which creates a multi-tenancy security issue. All triples are stored in the same graph space without any mechanism to prevent users from accessing other users' data or mixing collections.
+
+Unlike other storage backends in TrustGraph:
+- **Cassandra**: Uses separate keyspaces per user and tables per collection 
+- **Vector stores** (Milvus, Qdrant, Pinecone): Use collection-specific namespaces
+- **Neo4j**: Currently shares all data in a single graph (security vulnerability)
+
+## Current Architecture
+
+### Data Model
+- **Nodes**: `:Node` label with `uri` property, `:Literal` label with `value` property
+- **Relationships**: `:Rel` label with `uri` property
+- **Indexes**: `Node.uri`, `Literal.value`, `Rel.uri`
+
+### Message Flow
+- `Triples` messages contain `metadata.user` and `metadata.collection` fields
+- Storage service receives user/collection info but ignores it
+- Query service expects `user` and `collection` in `TriplesQueryRequest` but ignores them
+
+### Current Security Issue
+```cypher
+# Any user can query any data - no isolation
+MATCH (src:Node)-[rel:Rel]->(dest:Node) 
+RETURN src.uri, rel.uri, dest.uri
+```
+
+## Proposed Solution: Property-Based Filtering (Recommended)
+
+### Overview
+Add `user` and `collection` properties to all nodes and relationships, then filter all operations by these properties. This approach provides strong isolation while maintaining query flexibility and backwards compatibility.
+
+### Data Model Changes
+
+#### Enhanced Node Structure
+```cypher
+// Node entities
+CREATE (n:Node {
+  uri: "http://example.com/entity1",
+  user: "john_doe", 
+  collection: "production_v1"
+})
+
+// Literal entities  
+CREATE (n:Literal {
+  value: "literal value",
+  user: "john_doe",
+  collection: "production_v1" 
+})
+```
+
+#### Enhanced Relationship Structure
+```cypher
+// Relationships with user/collection properties
+CREATE (src)-[:Rel {
+  uri: "http://example.com/predicate1",
+  user: "john_doe",
+  collection: "production_v1"
+}]->(dest)
+```
+
+#### Updated Indexes
+```cypher
+// Compound indexes for efficient filtering
+CREATE INDEX node_user_collection_uri FOR (n:Node) ON (n.user, n.collection, n.uri);
+CREATE INDEX literal_user_collection_value FOR (n:Literal) ON (n.user, n.collection, n.value);
+CREATE INDEX rel_user_collection_uri FOR ()-[r:Rel]-() ON (r.user, r.collection, r.uri);
+
+// Maintain existing indexes for backwards compatibility (optional)
+CREATE INDEX Node_uri FOR (n:Node) ON (n.uri);
+CREATE INDEX Literal_value FOR (n:Literal) ON (n.value);
+CREATE INDEX Rel_uri FOR ()-[r:Rel]-() ON (r.uri);
+```
+
+### Implementation Changes
+
+#### Storage Service (`write.py`)
+
+**Current Code:**
+```python
+def create_node(self, uri):
+    summary = self.io.execute_query(
+        "MERGE (n:Node {uri: $uri})",
+        uri=uri, database_=self.db,
+    ).summary
+```
+
+**Updated Code:**
+```python
+def create_node(self, uri, user, collection):
+    summary = self.io.execute_query(
+        "MERGE (n:Node {uri: $uri, user: $user, collection: $collection})",
+        uri=uri, user=user, collection=collection, database_=self.db,
+    ).summary
+```
+
+**Enhanced store_triples Method:**
+```python
+async def store_triples(self, message):
+    user = message.metadata.user
+    collection = message.metadata.collection
+    
+    for t in message.triples:
+        self.create_node(t.s.value, user, collection)
+        
+        if t.o.is_uri:
+            self.create_node(t.o.value, user, collection)  
+            self.relate_node(t.s.value, t.p.value, t.o.value, user, collection)
+        else:
+            self.create_literal(t.o.value, user, collection)
+            self.relate_literal(t.s.value, t.p.value, t.o.value, user, collection)
+```
+
+#### Query Service (`service.py`) 
+
+**Current Code:**
+```python
+records, summary, keys = self.io.execute_query(
+    "MATCH (src:Node {uri: $src})-[rel:Rel {uri: $rel}]->(dest:Node) "
+    "RETURN dest.uri as dest",
+    src=query.s.value, rel=query.p.value, database_=self.db,
+)
+```
+
+**Updated Code:**
+```python
+records, summary, keys = self.io.execute_query(
+    "MATCH (src:Node {uri: $src, user: $user, collection: $collection})-"
+    "[rel:Rel {uri: $rel, user: $user, collection: $collection}]->"
+    "(dest:Node {user: $user, collection: $collection}) "
+    "RETURN dest.uri as dest",
+    src=query.s.value, rel=query.p.value, 
+    user=query.user, collection=query.collection,
+    database_=self.db,
+)
+```
+
+### Migration Strategy
+
+#### Phase 1: Add Properties to New Data
+1. Update storage service to add user/collection properties to new triples
+2. Maintain backwards compatibility by not requiring properties in queries
+3. Existing data remains accessible but not isolated
+
+#### Phase 2: Migrate Existing Data  
+```cypher
+// Migrate existing nodes (requires default user/collection assignment)
+MATCH (n:Node) WHERE n.user IS NULL
+SET n.user = 'legacy_user', n.collection = 'default_collection';
+
+MATCH (n:Literal) WHERE n.user IS NULL  
+SET n.user = 'legacy_user', n.collection = 'default_collection';
+
+MATCH ()-[r:Rel]->() WHERE r.user IS NULL
+SET r.user = 'legacy_user', r.collection = 'default_collection';
+```
+
+#### Phase 3: Enforce Isolation
+1. Update query service to require user/collection filtering
+2. Add validation to reject queries without proper user/collection context
+3. Remove legacy data access paths
+
+### Security Considerations
+
+#### Query Validation
+```python
+async def query_triples(self, query):
+    # Validate user/collection parameters
+    if not query.user or not query.collection:
+        raise ValueError("User and collection must be specified")
+    
+    # All queries must include user/collection filters
+    # ... rest of implementation
+```
+
+#### Preventing Parameter Injection
+- Use parameterized queries exclusively
+- Validate user/collection values against allowed patterns
+- Consider sanitization for Neo4j property name requirements
+
+#### Audit Trail
+```python
+logger.info(f"Query executed - User: {query.user}, Collection: {query.collection}, "
+           f"Pattern: {query.s}/{query.p}/{query.o}")
+```
+
+## Alternative Approaches Considered
+
+### Option 2: Label-Based Isolation
+
+**Approach**: Use dynamic labels like `User_john_Collection_prod`
+
+**Pros:**
+- Strong isolation through label filtering
+- Efficient query performance with label indexes
+- Clear data separation
+
+**Cons:**
+- Neo4j has practical limits on number of labels (~1000s)
+- Complex label name generation and sanitization
+- Difficult to query across collections when needed
+
+**Implementation Example:**
+```cypher
+CREATE (n:Node:User_john_Collection_prod {uri: "http://example.com/entity"})
+MATCH (n:User_john_Collection_prod) WHERE n:Node RETURN n
+```
+
+### Option 3: Database-Per-User
+
+**Approach**: Create separate Neo4j databases for each user or user/collection combination
+
+**Pros:**
+- Complete data isolation
+- No risk of cross-contamination
+- Independent scaling per user
+
+**Cons:**
+- Resource overhead (each database consumes memory)
+- Complex database lifecycle management
+- Neo4j Community Edition database limits
+- Difficult cross-user analytics
+
+### Option 4: Composite Key Strategy  
+
+**Approach**: Prefix all URIs and values with user/collection information
+
+**Pros:**
+- Backwards compatible with existing queries
+- Simple implementation
+- No schema changes required
+
+**Cons:**
+- URI pollution affects data semantics
+- Less efficient queries (string prefix matching)
+- Breaks RDF/semantic web standards
+
+**Implementation Example:**
+```python
+def make_composite_uri(uri, user, collection):
+    return f"usr:{user}:col:{collection}:uri:{uri}"
+```
+
+## Implementation Plan
+
+### Phase 1: Foundation (Week 1)
+1. [ ] Update storage service to accept and store user/collection properties
+2. [ ] Add compound indexes for efficient querying
+3. [ ] Implement backwards compatibility layer
+4. [ ] Create unit tests for new functionality
+
+### Phase 2: Query Updates (Week 2)  
+1. [ ] Update all query patterns to include user/collection filters
+2. [ ] Add query validation and security checks
+3. [ ] Update integration tests
+4. [ ] Performance testing with filtered queries
+
+### Phase 3: Migration & Deployment (Week 3)
+1. [ ] Create data migration scripts for existing Neo4j instances
+2. [ ] Deployment documentation and runbooks
+3. [ ] Monitoring and alerting for isolation violations
+4. [ ] End-to-end testing with multiple users/collections
+
+### Phase 4: Hardening (Week 4)
+1. [ ] Remove legacy compatibility mode
+2. [ ] Add comprehensive audit logging
+3. [ ] Security review and penetration testing
+4. [ ] Performance optimization
+
+## Testing Strategy
+
+### Unit Tests
+```python
+def test_user_collection_isolation():
+    # Store triples for user1/collection1
+    processor.store_triples(triples_user1_coll1)
+    
+    # Store triples for user2/collection2  
+    processor.store_triples(triples_user2_coll2)
+    
+    # Query as user1 should only return user1's data
+    results = processor.query_triples(query_user1_coll1)
+    assert all_results_belong_to_user1_coll1(results)
+    
+    # Query as user2 should only return user2's data
+    results = processor.query_triples(query_user2_coll2)
+    assert all_results_belong_to_user2_coll2(results)
+```
+
+### Integration Tests
+- Multi-user scenarios with overlapping data
+- Cross-collection queries (should fail)  
+- Migration testing with existing data
+- Performance benchmarks with large datasets
+
+### Security Tests
+- Attempt to query other users' data
+- SQL injection style attacks on user/collection parameters
+- Verify complete isolation under various query patterns
+
+## Performance Considerations
+
+### Index Strategy
+- Compound indexes on `(user, collection, uri)` for optimal filtering
+- Consider partial indexes if some collections are much larger
+- Monitor index usage and query performance
+
+### Query Optimization
+- Use EXPLAIN to verify index usage in filtered queries
+- Consider query result caching for frequently accessed data
+- Profile memory usage with large numbers of users/collections
+
+### Scalability
+- Each user/collection combination creates separate data islands
+- Monitor database size and connection pool usage
+- Consider horizontal scaling strategies if needed
+
+## Security & Compliance
+
+### Data Isolation Guarantees
+- **Physical**: All user data stored with explicit user/collection properties
+- **Logical**: All queries filtered by user/collection context
+- **Access Control**: Service-level validation prevents unauthorized access
+
+### Audit Requirements
+- Log all data access with user/collection context
+- Track migration activities and data movements
+- Monitor for isolation violation attempts
+
+### Compliance Considerations
+- GDPR: Enhanced ability to locate and delete user-specific data
+- SOC2: Clear data isolation and access controls
+- HIPAA: Strong tenant isolation for healthcare data
+
+## Risks & Mitigations
+
+| Risk | Impact | Likelihood | Mitigation |
+|------|--------|------------|------------|
+| Query missing user/collection filter | High | Medium | Mandatory validation, comprehensive testing |
+| Performance degradation | Medium | Low | Index optimization, query profiling |
+| Migration data corruption | High | Low | Backup strategy, rollback procedures |
+| Complex multi-collection queries | Medium | Medium | Document query patterns, provide examples |
+
+## Success Criteria
+
+1. **Security**: Zero cross-user data access in production
+2. **Performance**: <10% query performance impact vs unfiltered queries  
+3. **Migration**: 100% existing data successfully migrated with zero loss
+4. **Usability**: All existing query patterns work with user/collection context
+5. **Compliance**: Full audit trail of user/collection data access
+
+## Conclusion
+
+The property-based filtering approach provides the best balance of security, performance, and maintainability for adding user/collection isolation to Neo4j. It aligns with TrustGraph's existing multi-tenancy patterns while leveraging Neo4j's strengths in graph querying and indexing.
+
+This solution ensures TrustGraph's Neo4j backend meets the same security standards as other storage backends, preventing data isolation vulnerabilities while maintaining the flexibility and power of graph queries.