# Technical Specification: Multi-Tenant Support ## Overview Enable multi-tenant deployments by fixing parameter name mismatches that prevent queue customization and adding Cassandra keyspace parameterization. ## Architecture Context ### Flow-Based Queue Resolution The TrustGraph system uses a **flow-based architecture** for dynamic queue resolution, which inherently supports multi-tenancy: - **Flow Definitions** are stored in Cassandra and specify queue names via interface definitions - **Queue names use templates** with `{id}` variables that are replaced with flow instance IDs - **Services dynamically resolve queues** by looking up flow configurations at request time - **Each tenant can have unique flows** with different queue names, providing isolation Example flow interface definition: ```json { "interfaces": { "triples-store": "persistent://tg/flow/triples-store:{id}", "graph-embeddings-store": "persistent://tg/flow/graph-embeddings-store:{id}" } } ``` When tenant A starts flow `tenant-a-prod` and tenant B starts flow `tenant-b-prod`, they automatically get isolated queues: - `persistent://tg/flow/triples-store:tenant-a-prod` - `persistent://tg/flow/triples-store:tenant-b-prod` **Services correctly designed for multi-tenancy:** - ✅ **Knowledge Management (cores)** - Dynamically resolves queues from flow configuration passed in requests **Services needing fixes:** - 🔴 **Config Service** - Parameter name mismatch prevents queue customization - 🔴 **Librarian Service** - Hardcoded storage management topics (discussed below) - 🔴 **All Services** - Cannot customize Cassandra keyspace ## Problem Statement ### Issue #1: Parameter Name Mismatch in AsyncProcessor - **CLI defines:** `--config-queue` (unclear naming) - **Argparse converts to:** `config_queue` (in params dict) - **Code looks for:** `config_push_queue` - **Result:** Parameter is ignored, defaults to `persistent://tg/config/config` - **Impact:** Affects all 32+ services inheriting from AsyncProcessor - **Blocks:** Multi-tenant deployments cannot use tenant-specific config queues - **Solution:** Rename CLI parameter to `--config-push-queue` for clarity (breaking change acceptable since feature is currently broken) ### Issue #2: Parameter Name Mismatch in Config Service - **CLI defines:** `--push-queue` (ambiguous naming) - **Argparse converts to:** `push_queue` (in params dict) - **Code looks for:** `config_push_queue` - **Result:** Parameter is ignored - **Impact:** Config service cannot use custom push queue - **Solution:** Rename CLI parameter to `--config-push-queue` for consistency and clarity (breaking change acceptable) ### Issue #3: Hardcoded Cassandra Keyspace - **Current:** Keyspace hardcoded as `"config"`, `"knowledge"`, `"librarian"` in various services - **Result:** Cannot customize keyspace for multi-tenant deployments - **Impact:** Config, cores, and librarian services - **Blocks:** Multiple tenants cannot use separate Cassandra keyspaces ### Issue #4: Collection Management Architecture ✅ COMPLETED - **Previous:** Collections stored in Cassandra librarian keyspace via separate collections table - **Previous:** Librarian used 4 hardcoded storage management topics to coordinate collection create/delete: - `vector_storage_management_topic` - `object_storage_management_topic` - `triples_storage_management_topic` - `storage_management_response_topic` - **Problems (Resolved):** - Hardcoded topics could not be customized for multi-tenant deployments - Complex async coordination between librarian and 4+ storage services - Separate Cassandra table and management infrastructure - Non-persistent request/response queues for critical operations - **Solution Implemented:** Migrated collections to config service storage, use config push for distribution - **Status:** All storage backends migrated to `CollectionConfigHandler` pattern ## Solution This spec addresses Issues #1, #2, #3, and #4. ### Part 1: Fix Parameter Name Mismatches #### Change 1: AsyncProcessor Base Class - Rename CLI Parameter **File:** `trustgraph-base/trustgraph/base/async_processor.py` **Line:** 260-264 **Current:** ```python parser.add_argument( '--config-queue', default=default_config_queue, help=f'Config push queue {default_config_queue}', ) ``` **Fixed:** ```python parser.add_argument( '--config-push-queue', default=default_config_queue, help=f'Config push queue (default: {default_config_queue})', ) ``` **Rationale:** - Clearer, more explicit naming - Matches the internal variable name `config_push_queue` - Breaking change acceptable since feature is currently non-functional - No code change needed in params.get() - it already looks for the correct name #### Change 2: Config Service - Rename CLI Parameter **File:** `trustgraph-flow/trustgraph/config/service/service.py` **Line:** 276-279 **Current:** ```python parser.add_argument( '--push-queue', default=default_config_push_queue, help=f'Config push queue (default: {default_config_push_queue})' ) ``` **Fixed:** ```python parser.add_argument( '--config-push-queue', default=default_config_push_queue, help=f'Config push queue (default: {default_config_push_queue})' ) ``` **Rationale:** - Clearer naming - "config-push-queue" is more explicit than just "push-queue" - Matches the internal variable name `config_push_queue` - Consistent with AsyncProcessor's `--config-push-queue` parameter - Breaking change acceptable since feature is currently non-functional - No code change needed in params.get() - it already looks for the correct name ### Part 2: Add Cassandra Keyspace Parameterization #### Change 3: Add Keyspace Parameter to cassandra_config Module **File:** `trustgraph-base/trustgraph/base/cassandra_config.py` **Add CLI argument** (in `add_cassandra_args()` function): ```python parser.add_argument( '--cassandra-keyspace', default=None, help='Cassandra keyspace (default: service-specific)' ) ``` **Add environment variable support** (in `resolve_cassandra_config()` function): ```python keyspace = params.get( "cassandra_keyspace", os.environ.get("CASSANDRA_KEYSPACE") ) ``` **Update return value** of `resolve_cassandra_config()`: - Currently returns: `(hosts, username, password)` - Change to return: `(hosts, username, password, keyspace)` **Rationale:** - Consistent with existing Cassandra configuration pattern - Available to all services via `add_cassandra_args()` - Supports both CLI and environment variable configuration #### Change 4: Config Service - Use Parameterized Keyspace **File:** `trustgraph-flow/trustgraph/config/service/service.py` **Line 30** - Remove hardcoded keyspace: ```python # DELETE THIS LINE: keyspace = "config" ``` **Lines 69-73** - Update cassandra config resolution: **Current:** ```python cassandra_host, cassandra_username, cassandra_password = \ resolve_cassandra_config(params) ``` **Fixed:** ```python cassandra_host, cassandra_username, cassandra_password, keyspace = \ resolve_cassandra_config(params, default_keyspace="config") ``` **Rationale:** - Maintains backward compatibility with "config" as default - Allows override via `--cassandra-keyspace` or `CASSANDRA_KEYSPACE` #### Change 5: Cores/Knowledge Service - Use Parameterized Keyspace **File:** `trustgraph-flow/trustgraph/cores/service.py` **Line 37** - Remove hardcoded keyspace: ```python # DELETE THIS LINE: keyspace = "knowledge" ``` **Update cassandra config resolution** (similar location as config service): ```python cassandra_host, cassandra_username, cassandra_password, keyspace = \ resolve_cassandra_config(params, default_keyspace="knowledge") ``` #### Change 6: Librarian Service - Use Parameterized Keyspace **File:** `trustgraph-flow/trustgraph/librarian/service.py` **Line 51** - Remove hardcoded keyspace: ```python # DELETE THIS LINE: keyspace = "librarian" ``` **Update cassandra config resolution** (similar location as config service): ```python cassandra_host, cassandra_username, cassandra_password, keyspace = \ resolve_cassandra_config(params, default_keyspace="librarian") ``` ### Part 3: Migrate Collection Management to Config Service #### Overview Migrate collections from Cassandra librarian keyspace to config service storage. This eliminates hardcoded storage management topics and simplifies the architecture by using the existing config push mechanism for distribution. #### Current Architecture ``` API Request → Gateway → Librarian Service ↓ CollectionManager ↓ Cassandra Collections Table (librarian keyspace) ↓ Broadcast to 4 Storage Management Topics (hardcoded) ↓ Wait for 4+ Storage Service Responses ↓ Response to Gateway ``` #### New Architecture ``` API Request → Gateway → Librarian Service ↓ CollectionManager ↓ Config Service API (put/delete/getvalues) ↓ Cassandra Config Table (class='collections', key='user:collection') ↓ Config Push (to all subscribers on config-push-queue) ↓ All Storage Services receive config update independently ``` #### Change 7: Collection Manager - Use Config Service API **File:** `trustgraph-flow/trustgraph/librarian/collection_manager.py` **Remove:** - `LibraryTableStore` usage (Lines 33, 40-41) - Storage management producers initialization (Lines 86-140) - `on_storage_response` method (Lines 400-430) - `pending_deletions` tracking (Lines 57, 90-96, and usage throughout) **Add:** - Config service client for API calls (request/response pattern) **Config Client Setup:** ```python # In __init__, add config request/response producers/consumers from trustgraph.schema.services.config import ConfigRequest, ConfigResponse # Producer for config requests self.config_request_producer = Producer( client=pulsar_client, topic=config_request_queue, schema=ConfigRequest, ) # Consumer for config responses (with correlation ID) self.config_response_consumer = Consumer( taskgroup=taskgroup, client=pulsar_client, flow=None, topic=config_response_queue, subscriber=f"{id}-config", schema=ConfigResponse, handler=self.on_config_response, ) # Tracking for pending config requests self.pending_config_requests = {} # request_id -> asyncio.Event ``` **Modify `list_collections` (Lines 145-180):** ```python async def list_collections(self, user, tag_filter=None, limit=None): """List collections from config service""" # Send getvalues request to config service request = ConfigRequest( id=str(uuid.uuid4()), operation='getvalues', type='collections', ) # Send request and wait for response response = await self.send_config_request(request) # Parse collections from response collections = [] for key, value_json in response.values.items(): if ":" in key: coll_user, collection = key.split(":", 1) if coll_user == user: metadata = json.loads(value_json) collections.append(CollectionMetadata(**metadata)) # Apply tag filtering in-memory (as before) if tag_filter: collections = [c for c in collections if any(tag in c.tags for tag in tag_filter)] # Apply limit if limit: collections = collections[:limit] return collections async def send_config_request(self, request): """Send config request and wait for response""" event = asyncio.Event() self.pending_config_requests[request.id] = event await self.config_request_producer.send(request) await event.wait() return self.pending_config_requests.pop(request.id + "_response") async def on_config_response(self, message, consumer, flow): """Handle config response""" response = message.value() if response.id in self.pending_config_requests: self.pending_config_requests[response.id + "_response"] = response self.pending_config_requests[response.id].set() ``` **Modify `update_collection` (Lines 182-312):** ```python async def update_collection(self, user, collection, name, description, tags): """Update collection via config service""" # Create metadata metadata = CollectionMetadata( user=user, collection=collection, name=name, description=description, tags=tags, ) # Send put request to config service request = ConfigRequest( id=str(uuid.uuid4()), operation='put', type='collections', key=f'{user}:{collection}', value=json.dumps(metadata.to_dict()), ) response = await self.send_config_request(request) if response.error: raise RuntimeError(f"Config update failed: {response.error.message}") # Config service will trigger config push automatically # Storage services will receive update and create collections ``` **Modify `delete_collection` (Lines 314-398):** ```python async def delete_collection(self, user, collection): """Delete collection via config service""" # Send delete request to config service request = ConfigRequest( id=str(uuid.uuid4()), operation='delete', type='collections', key=f'{user}:{collection}', ) response = await self.send_config_request(request) if response.error: raise RuntimeError(f"Config delete failed: {response.error.message}") # Config service will trigger config push automatically # Storage services will receive update and delete collections ``` **Collection Metadata Format:** - Stored in config table as: `class='collections', key='user:collection'` - Value is JSON-serialized CollectionMetadata (without timestamp fields) - Fields: `user`, `collection`, `name`, `description`, `tags` - Example: `class='collections', key='alice:my-docs', value='{"user":"alice","collection":"my-docs","name":"My Documents","description":"...","tags":["work"]}'` #### Change 8: Librarian Service - Remove Storage Management Infrastructure **File:** `trustgraph-flow/trustgraph/librarian/service.py` **Remove:** - Storage management producers (Lines 173-190): - `vector_storage_management_producer` - `object_storage_management_producer` - `triples_storage_management_producer` - Storage response consumer (Lines 192-201) - `on_storage_response` handler (Lines 467-473) **Modify:** - CollectionManager initialization (Lines 215-224) - remove storage producer parameters **Note:** External collection API remains unchanged: - `list-collections` - `update-collection` - `delete-collection` #### Change 9: Remove Collections Table from LibraryTableStore **File:** `trustgraph-flow/trustgraph/tables/library.py` **Delete:** - Collections table CREATE statement (Lines 114-127) - Collections prepared statements (Lines 205-240) - All collection methods (Lines 578-717): - `ensure_collection_exists` - `list_collections` - `update_collection` - `delete_collection` - `get_collection` - `create_collection` **Rationale:** - Collections now stored in config table - Breaking change acceptable - no data migration needed - Simplifies librarian service significantly #### Change 10: Storage Services - Config-Based Collection Management ✅ COMPLETED **Status:** All 11 storage backends have been migrated to use `CollectionConfigHandler`. **Affected Services (11 total):** - Document embeddings: milvus, pinecone, qdrant - Graph embeddings: milvus, pinecone, qdrant - Object storage: cassandra - Triples storage: cassandra, falkordb, memgraph, neo4j **Files:** - `trustgraph-flow/trustgraph/storage/doc_embeddings/milvus/write.py` - `trustgraph-flow/trustgraph/storage/doc_embeddings/pinecone/write.py` - `trustgraph-flow/trustgraph/storage/doc_embeddings/qdrant/write.py` - `trustgraph-flow/trustgraph/storage/graph_embeddings/milvus/write.py` - `trustgraph-flow/trustgraph/storage/graph_embeddings/pinecone/write.py` - `trustgraph-flow/trustgraph/storage/graph_embeddings/qdrant/write.py` - `trustgraph-flow/trustgraph/storage/objects/cassandra/write.py` - `trustgraph-flow/trustgraph/storage/triples/cassandra/write.py` - `trustgraph-flow/trustgraph/storage/triples/falkordb/write.py` - `trustgraph-flow/trustgraph/storage/triples/memgraph/write.py` - `trustgraph-flow/trustgraph/storage/triples/neo4j/write.py` **Implementation Pattern (all services):** 1. **Register config handler in `__init__`:** ```python # Add after AsyncProcessor initialization self.register_config_handler(self.on_collection_config) self.known_collections = set() # Track (user, collection) tuples ``` 2. **Implement config handler:** ```python async def on_collection_config(self, config, version): """Handle collection configuration updates""" logger.info(f"Collection config version: {version}") if "collections" not in config: return # Parse collections from config # Key format: "user:collection" in config["collections"] config_collections = set() for key in config["collections"].keys(): if ":" in key: user, collection = key.split(":", 1) config_collections.add((user, collection)) # Determine changes to_create = config_collections - self.known_collections to_delete = self.known_collections - config_collections # Create new collections (idempotent) for user, collection in to_create: try: await self.create_collection_internal(user, collection) self.known_collections.add((user, collection)) logger.info(f"Created collection: {user}/{collection}") except Exception as e: logger.error(f"Failed to create {user}/{collection}: {e}") # Delete removed collections (idempotent) for user, collection in to_delete: try: await self.delete_collection_internal(user, collection) self.known_collections.discard((user, collection)) logger.info(f"Deleted collection: {user}/{collection}") except Exception as e: logger.error(f"Failed to delete {user}/{collection}: {e}") ``` 3. **Initialize known collections on startup:** ```python async def start(self): """Start the processor""" await super().start() await self.sync_known_collections() async def sync_known_collections(self): """Query backend to populate known_collections set""" # Backend-specific implementation: # - Milvus/Pinecone/Qdrant: List collections/indexes matching naming pattern # - Cassandra: Query keyspaces or collection metadata # - Neo4j/Memgraph/FalkorDB: Query CollectionMetadata nodes pass ``` 4. **Refactor existing handler methods:** ```python # Rename and remove response sending: # handle_create_collection → create_collection_internal # handle_delete_collection → delete_collection_internal async def create_collection_internal(self, user, collection): """Create collection (idempotent)""" # Same logic as current handle_create_collection # But remove response producer calls # Handle "already exists" gracefully pass async def delete_collection_internal(self, user, collection): """Delete collection (idempotent)""" # Same logic as current handle_delete_collection # But remove response producer calls # Handle "not found" gracefully pass ``` 5. **Remove storage management infrastructure:** - Remove `self.storage_request_consumer` setup and start - Remove `self.storage_response_producer` setup - Remove `on_storage_management` dispatcher method - Remove metrics for storage management - Remove imports: `StorageManagementRequest`, `StorageManagementResponse` **Backend-Specific Considerations:** - **Vector stores (Milvus, Pinecone, Qdrant):** Track logical `(user, collection)` in `known_collections`, but may create multiple backend collections per dimension. Continue lazy creation pattern. Delete operations must remove all dimension variants. - **Cassandra Objects:** Collections are row properties, not structures. Track keyspace-level information. - **Graph stores (Neo4j, Memgraph, FalkorDB):** Query `CollectionMetadata` nodes on startup. Create/delete metadata nodes on sync. - **Cassandra Triples:** Use `KnowledgeGraph` API for collection operations. **Key Design Points:** - **Eventual consistency:** No request/response mechanism, config push is broadcast - **Idempotency:** All create/delete operations must be safe to retry - **Error handling:** Log errors but don't block config updates - **Self-healing:** Failed operations will retry on next config push - **Collection key format:** `"user:collection"` in `config["collections"]` #### Change 11: Update Collection Schema - Remove Timestamps **File:** `trustgraph-base/trustgraph/schema/services/collection.py` **Modify CollectionMetadata (Lines 13-21):** Remove `created_at` and `updated_at` fields: ```python class CollectionMetadata(Record): user = String() collection = String() name = String() description = String() tags = Array(String()) # Remove: created_at = String() # Remove: updated_at = String() ``` **Modify CollectionManagementRequest (Lines 25-47):** Remove timestamp fields: ```python class CollectionManagementRequest(Record): operation = String() user = String() collection = String() timestamp = String() name = String() description = String() tags = Array(String()) # Remove: created_at = String() # Remove: updated_at = String() tag_filter = Array(String()) limit = Integer() ``` **Rationale:** - Timestamps don't add value for collections - Config service maintains its own version tracking - Simplifies schema and reduces storage #### Benefits of Config Service Migration 1. ✅ **Eliminates hardcoded storage management topics** - Solves multi-tenant blocker 2. ✅ **Simpler coordination** - No complex async waiting for 4+ storage responses 3. ✅ **Eventual consistency** - Storage services update independently via config push 4. ✅ **Better reliability** - Persistent config push vs non-persistent request/response 5. ✅ **Unified configuration model** - Collections treated as configuration 6. ✅ **Reduces complexity** - Removes ~300 lines of coordination code 7. ✅ **Multi-tenant ready** - Config already supports tenant isolation via keyspace 8. ✅ **Version tracking** - Config service version mechanism provides audit trail ## Implementation Notes ### Backward Compatibility **Parameter Changes:** - CLI parameter renames are breaking changes but acceptable (feature currently non-functional) - Services work without parameters (use defaults) - Default keyspaces preserved: "config", "knowledge", "librarian" - Default queue: `persistent://tg/config/config` **Collection Management:** - **Breaking change:** Collections table removed from librarian keyspace - **No data migration provided** - acceptable for this phase - External collection API unchanged (list/update/delete operations) - Collection metadata format simplified (timestamps removed) ### Testing Requirements **Parameter Testing:** 1. Verify `--config-push-queue` parameter works on graph-embeddings service 2. Verify `--config-push-queue` parameter works on text-completion service 3. Verify `--config-push-queue` parameter works on config service 4. Verify `--cassandra-keyspace` parameter works for config service 5. Verify `--cassandra-keyspace` parameter works for cores service 6. Verify `--cassandra-keyspace` parameter works for librarian service 7. Verify services work without parameters (uses defaults) 8. Verify multi-tenant deployment with custom queue names and keyspace **Collection Management Testing:** 9. Verify `list-collections` operation via config service 10. Verify `update-collection` creates/updates in config table 11. Verify `delete-collection` removes from config table 12. Verify config push is triggered on collection updates 13. Verify tag filtering works with config-based storage 14. Verify collection operations work without timestamp fields ### Multi-Tenant Deployment Example ```bash # Tenant: tg-dev graph-embeddings \ -p pulsar+ssl://broker:6651 \ --pulsar-api-key \ --config-push-queue persistent://tg-dev/config/config config-service \ -p pulsar+ssl://broker:6651 \ --pulsar-api-key \ --config-push-queue persistent://tg-dev/config/config \ --cassandra-keyspace tg_dev_config ``` ## Impact Analysis ### Services Affected by Change 1-2 (CLI Parameter Rename) All services inheriting from AsyncProcessor or FlowProcessor: - config-service - cores-service - librarian-service - graph-embeddings - document-embeddings - text-completion-* (all providers) - extract-* (all extractors) - query-* (all query services) - retrieval-* (all RAG services) - storage-* (all storage services) - And 20+ more services ### Services Affected by Changes 3-6 (Cassandra Keyspace) - config-service - cores-service - librarian-service ### Services Affected by Changes 7-11 (Collection Management) **Immediate Changes:** - librarian-service (collection_manager.py, service.py) - tables/library.py (collections table removal) - schema/services/collection.py (timestamp removal) **Completed Changes (Change 10):** ✅ - All storage services (11 total) - migrated to config push for collection updates via `CollectionConfigHandler` - Storage management schema removed from `storage.py` ## Future Considerations ### Per-User Keyspace Model Some services use **per-user keyspaces** dynamically, where each user gets their own Cassandra keyspace: **Services with per-user keyspaces:** 1. **Triples Query Service** (`trustgraph-flow/trustgraph/query/triples/cassandra/service.py:65`) - Uses `keyspace=query.user` 2. **Objects Query Service** (`trustgraph-flow/trustgraph/query/objects/cassandra/service.py:479`) - Uses `keyspace=self.sanitize_name(user)` 3. **KnowledgeGraph Direct Access** (`trustgraph-flow/trustgraph/direct/cassandra_kg.py:18`) - Default parameter `keyspace="trustgraph"` **Status:** These are **not modified** in this specification. **Future Review Required:** - Evaluate whether per-user keyspace model creates tenant isolation issues - Consider if multi-tenant deployments need keyspace prefix patterns (e.g., `tenant_a_user1`) - Review for potential user ID collision across tenants - Assess if single shared keyspace per tenant with user-based row isolation is preferable **Note:** This does not block the current multi-tenant implementation but should be reviewed before production multi-tenant deployments. ## Implementation Phases ### Phase 1: Parameter Fixes (Changes 1-6) - Fix `--config-push-queue` parameter naming - Add `--cassandra-keyspace` parameter support - **Outcome:** Multi-tenant queue and keyspace configuration enabled ### Phase 2: Collection Management Migration (Changes 7-9, 11) - Migrate collection storage to config service - Remove collections table from librarian - Update collection schema (remove timestamps) - **Outcome:** Eliminates hardcoded storage management topics, simplifies librarian ### Phase 3: Storage Service Updates (Change 10) ✅ COMPLETED - Updated all storage services to use config push for collections via `CollectionConfigHandler` - Removed storage management request/response infrastructure - Removed legacy schema definitions - **Outcome:** Complete config-based collection management achieved ## References - GitHub Issue: https://github.com/trustgraph-ai/trustgraph/issues/582 - Related Files: - `trustgraph-base/trustgraph/base/async_processor.py` - `trustgraph-base/trustgraph/base/cassandra_config.py` - `trustgraph-base/trustgraph/schema/core/topic.py` - `trustgraph-base/trustgraph/schema/services/collection.py` - `trustgraph-flow/trustgraph/config/service/service.py` - `trustgraph-flow/trustgraph/cores/service.py` - `trustgraph-flow/trustgraph/librarian/service.py` - `trustgraph-flow/trustgraph/librarian/collection_manager.py` - `trustgraph-flow/trustgraph/tables/library.py`