trustgraph/docs/tech-specs/neo4j-user-collection-isolation.md
Alex Jenkins 8954fa3ad7 Feat: TrustGraph i18n & Documentation Translation Updates (#781)
Native CLI i18n: The TrustGraph CLI has built-in translation support
that dynamically loads language strings. You can test and use
different languages by simply passing the --lang flag (e.g., --lang
es for Spanish, --lang ru for Russian) or by configuring your
environment's LANG variable.

Automated Docs Translations: This PR introduces autonomously
translated Markdown documentation into several target languages,
including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew,
Arabic, Simplified Chinese, and Russian.
2026-04-14 12:08:32 +01:00

12 KiB

layout title parent
default Neo4j User/Collection Isolation Support Tech Specs

Neo4j User/Collection Isolation Support

Problem Statement

The Neo4j triples storage and query implementation currently lacks user/collection isolation, which creates a multi-tenancy security issue. All triples are stored in the same graph space without any mechanism to prevent users from accessing other users' data or mixing collections.

Unlike other storage backends in TrustGraph:

  • Cassandra: Uses separate keyspaces per user and tables per collection
  • Vector stores (Milvus, Qdrant, Pinecone): Use collection-specific namespaces
  • Neo4j: Currently shares all data in a single graph (security vulnerability)

Current Architecture

Data Model

  • Nodes: :Node label with uri property, :Literal label with value property
  • Relationships: :Rel label with uri property
  • Indexes: Node.uri, Literal.value, Rel.uri

Message Flow

  • Triples messages contain metadata.user and metadata.collection fields
  • Storage service receives user/collection info but ignores it
  • Query service expects user and collection in TriplesQueryRequest but ignores them

Current Security Issue

# Any user can query any data - no isolation
MATCH (src:Node)-[rel:Rel]->(dest:Node) 
RETURN src.uri, rel.uri, dest.uri

Overview

Add user and collection properties to all nodes and relationships, then filter all operations by these properties. This approach provides strong isolation while maintaining query flexibility and backwards compatibility.

Data Model Changes

Enhanced Node Structure

// Node entities
CREATE (n:Node {
  uri: "http://example.com/entity1",
  user: "john_doe", 
  collection: "production_v1"
})

// Literal entities  
CREATE (n:Literal {
  value: "literal value",
  user: "john_doe",
  collection: "production_v1" 
})

Enhanced Relationship Structure

// Relationships with user/collection properties
CREATE (src)-[:Rel {
  uri: "http://example.com/predicate1",
  user: "john_doe",
  collection: "production_v1"
}]->(dest)

Updated Indexes

// Compound indexes for efficient filtering
CREATE INDEX node_user_collection_uri FOR (n:Node) ON (n.user, n.collection, n.uri);
CREATE INDEX literal_user_collection_value FOR (n:Literal) ON (n.user, n.collection, n.value);
CREATE INDEX rel_user_collection_uri FOR ()-[r:Rel]-() ON (r.user, r.collection, r.uri);

// Maintain existing indexes for backwards compatibility (optional)
CREATE INDEX Node_uri FOR (n:Node) ON (n.uri);
CREATE INDEX Literal_value FOR (n:Literal) ON (n.value);
CREATE INDEX Rel_uri FOR ()-[r:Rel]-() ON (r.uri);

Implementation Changes

Storage Service (write.py)

Current Code:

def create_node(self, uri):
    summary = self.io.execute_query(
        "MERGE (n:Node {uri: $uri})",
        uri=uri, database_=self.db,
    ).summary

Updated Code:

def create_node(self, uri, user, collection):
    summary = self.io.execute_query(
        "MERGE (n:Node {uri: $uri, user: $user, collection: $collection})",
        uri=uri, user=user, collection=collection, database_=self.db,
    ).summary

Enhanced store_triples Method:

async def store_triples(self, message):
    user = message.metadata.user
    collection = message.metadata.collection
    
    for t in message.triples:
        self.create_node(t.s.value, user, collection)
        
        if t.o.is_uri:
            self.create_node(t.o.value, user, collection)  
            self.relate_node(t.s.value, t.p.value, t.o.value, user, collection)
        else:
            self.create_literal(t.o.value, user, collection)
            self.relate_literal(t.s.value, t.p.value, t.o.value, user, collection)

Query Service (service.py)

Current Code:

records, summary, keys = self.io.execute_query(
    "MATCH (src:Node {uri: $src})-[rel:Rel {uri: $rel}]->(dest:Node) "
    "RETURN dest.uri as dest",
    src=query.s.value, rel=query.p.value, database_=self.db,
)

Updated Code:

records, summary, keys = self.io.execute_query(
    "MATCH (src:Node {uri: $src, user: $user, collection: $collection})-"
    "[rel:Rel {uri: $rel, user: $user, collection: $collection}]->"
    "(dest:Node {user: $user, collection: $collection}) "
    "RETURN dest.uri as dest",
    src=query.s.value, rel=query.p.value, 
    user=query.user, collection=query.collection,
    database_=self.db,
)

Migration Strategy

Phase 1: Add Properties to New Data

  1. Update storage service to add user/collection properties to new triples
  2. Maintain backwards compatibility by not requiring properties in queries
  3. Existing data remains accessible but not isolated

Phase 2: Migrate Existing Data

// Migrate existing nodes (requires default user/collection assignment)
MATCH (n:Node) WHERE n.user IS NULL
SET n.user = 'legacy_user', n.collection = 'default_collection';

MATCH (n:Literal) WHERE n.user IS NULL  
SET n.user = 'legacy_user', n.collection = 'default_collection';

MATCH ()-[r:Rel]->() WHERE r.user IS NULL
SET r.user = 'legacy_user', r.collection = 'default_collection';

Phase 3: Enforce Isolation

  1. Update query service to require user/collection filtering
  2. Add validation to reject queries without proper user/collection context
  3. Remove legacy data access paths

Security Considerations

Query Validation

async def query_triples(self, query):
    # Validate user/collection parameters
    if not query.user or not query.collection:
        raise ValueError("User and collection must be specified")
    
    # All queries must include user/collection filters
    # ... rest of implementation

Preventing Parameter Injection

  • Use parameterized queries exclusively
  • Validate user/collection values against allowed patterns
  • Consider sanitization for Neo4j property name requirements

Audit Trail

logger.info(f"Query executed - User: {query.user}, Collection: {query.collection}, "
           f"Pattern: {query.s}/{query.p}/{query.o}")

Alternative Approaches Considered

Option 2: Label-Based Isolation

Approach: Use dynamic labels like User_john_Collection_prod

Pros:

  • Strong isolation through label filtering
  • Efficient query performance with label indexes
  • Clear data separation

Cons:

  • Neo4j has practical limits on number of labels (~1000s)
  • Complex label name generation and sanitization
  • Difficult to query across collections when needed

Implementation Example:

CREATE (n:Node:User_john_Collection_prod {uri: "http://example.com/entity"})
MATCH (n:User_john_Collection_prod) WHERE n:Node RETURN n

Option 3: Database-Per-User

Approach: Create separate Neo4j databases for each user or user/collection combination

Pros:

  • Complete data isolation
  • No risk of cross-contamination
  • Independent scaling per user

Cons:

  • Resource overhead (each database consumes memory)
  • Complex database lifecycle management
  • Neo4j Community Edition database limits
  • Difficult cross-user analytics

Option 4: Composite Key Strategy

Approach: Prefix all URIs and values with user/collection information

Pros:

  • Backwards compatible with existing queries
  • Simple implementation
  • No schema changes required

Cons:

  • URI pollution affects data semantics
  • Less efficient queries (string prefix matching)
  • Breaks RDF/semantic web standards

Implementation Example:

def make_composite_uri(uri, user, collection):
    return f"usr:{user}:col:{collection}:uri:{uri}"

Implementation Plan

Phase 1: Foundation (Week 1)

  1. Update storage service to accept and store user/collection properties
  2. Add compound indexes for efficient querying
  3. Implement backwards compatibility layer
  4. Create unit tests for new functionality

Phase 2: Query Updates (Week 2)

  1. Update all query patterns to include user/collection filters
  2. Add query validation and security checks
  3. Update integration tests
  4. Performance testing with filtered queries

Phase 3: Migration & Deployment (Week 3)

  1. Create data migration scripts for existing Neo4j instances
  2. Deployment documentation and runbooks
  3. Monitoring and alerting for isolation violations
  4. End-to-end testing with multiple users/collections

Phase 4: Hardening (Week 4)

  1. Remove legacy compatibility mode
  2. Add comprehensive audit logging
  3. Security review and penetration testing
  4. Performance optimization

Testing Strategy

Unit Tests

def test_user_collection_isolation():
    # Store triples for user1/collection1
    processor.store_triples(triples_user1_coll1)
    
    # Store triples for user2/collection2  
    processor.store_triples(triples_user2_coll2)
    
    # Query as user1 should only return user1's data
    results = processor.query_triples(query_user1_coll1)
    assert all_results_belong_to_user1_coll1(results)
    
    # Query as user2 should only return user2's data
    results = processor.query_triples(query_user2_coll2)
    assert all_results_belong_to_user2_coll2(results)

Integration Tests

  • Multi-user scenarios with overlapping data
  • Cross-collection queries (should fail)
  • Migration testing with existing data
  • Performance benchmarks with large datasets

Security Tests

  • Attempt to query other users' data
  • SQL injection style attacks on user/collection parameters
  • Verify complete isolation under various query patterns

Performance Considerations

Index Strategy

  • Compound indexes on (user, collection, uri) for optimal filtering
  • Consider partial indexes if some collections are much larger
  • Monitor index usage and query performance

Query Optimization

  • Use EXPLAIN to verify index usage in filtered queries
  • Consider query result caching for frequently accessed data
  • Profile memory usage with large numbers of users/collections

Scalability

  • Each user/collection combination creates separate data islands
  • Monitor database size and connection pool usage
  • Consider horizontal scaling strategies if needed

Security & Compliance

Data Isolation Guarantees

  • Physical: All user data stored with explicit user/collection properties
  • Logical: All queries filtered by user/collection context
  • Access Control: Service-level validation prevents unauthorized access

Audit Requirements

  • Log all data access with user/collection context
  • Track migration activities and data movements
  • Monitor for isolation violation attempts

Compliance Considerations

  • GDPR: Enhanced ability to locate and delete user-specific data
  • SOC2: Clear data isolation and access controls
  • HIPAA: Strong tenant isolation for healthcare data

Risks & Mitigations

Risk Impact Likelihood Mitigation
Query missing user/collection filter High Medium Mandatory validation, comprehensive testing
Performance degradation Medium Low Index optimization, query profiling
Migration data corruption High Low Backup strategy, rollback procedures
Complex multi-collection queries Medium Medium Document query patterns, provide examples

Success Criteria

  1. Security: Zero cross-user data access in production
  2. Performance: <10% query performance impact vs unfiltered queries
  3. Migration: 100% existing data successfully migrated with zero loss
  4. Usability: All existing query patterns work with user/collection context
  5. Compliance: Full audit trail of user/collection data access

Conclusion

The property-based filtering approach provides the best balance of security, performance, and maintainability for adding user/collection isolation to Neo4j. It aligns with TrustGraph's existing multi-tenancy patterns while leveraging Neo4j's strengths in graph querying and indexing.

This solution ensures TrustGraph's Neo4j backend meets the same security standards as other storage backends, preventing data isolation vulnerabilities while maintaining the flexibility and power of graph queries.