Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.
28 KiB
| layout | title | parent |
|---|---|---|
| default | Technical Specification: Multi-Tenant Support | Tech Specs |
Technical Specification: Multi-Tenant Support
Overview
Enable multi-tenant deployments by fixing parameter name mismatches that prevent queue customization and adding Cassandra keyspace parameterization.
Architecture Context
Flow-Based Queue Resolution
The TrustGraph system uses a flow-based architecture for dynamic queue resolution, which inherently supports multi-tenancy:
- Flow Definitions are stored in Cassandra and specify queue names via interface definitions
- Queue names use templates with
{id}variables that are replaced with flow instance IDs - Services dynamically resolve queues by looking up flow configurations at request time
- Each tenant can have unique flows with different queue names, providing isolation
Example flow interface definition:
{
"interfaces": {
"triples-store": "persistent://tg/flow/triples-store:{id}",
"graph-embeddings-store": "persistent://tg/flow/graph-embeddings-store:{id}"
}
}
When tenant A starts flow tenant-a-prod and tenant B starts flow tenant-b-prod, they automatically get isolated queues:
persistent://tg/flow/triples-store:tenant-a-prodpersistent://tg/flow/triples-store:tenant-b-prod
Services correctly designed for multi-tenancy:
- ✅ Knowledge Management (cores) - Dynamically resolves queues from flow configuration passed in requests
Services needing fixes:
- 🔴 Config Service - Parameter name mismatch prevents queue customization
- 🔴 Librarian Service - Hardcoded storage management topics (discussed below)
- 🔴 All Services - Cannot customize Cassandra keyspace
Problem Statement
Issue #1: Parameter Name Mismatch in AsyncProcessor
- CLI defines:
--config-queue(unclear naming) - Argparse converts to:
config_queue(in params dict) - Code looks for:
config_push_queue - Result: Parameter is ignored, defaults to
persistent://tg/config/config - Impact: Affects all 32+ services inheriting from AsyncProcessor
- Blocks: Multi-tenant deployments cannot use tenant-specific config queues
- Solution: Rename CLI parameter to
--config-push-queuefor clarity (breaking change acceptable since feature is currently broken)
Issue #2: Parameter Name Mismatch in Config Service
- CLI defines:
--push-queue(ambiguous naming) - Argparse converts to:
push_queue(in params dict) - Code looks for:
config_push_queue - Result: Parameter is ignored
- Impact: Config service cannot use custom push queue
- Solution: Rename CLI parameter to
--config-push-queuefor consistency and clarity (breaking change acceptable)
Issue #3: Hardcoded Cassandra Keyspace
- Current: Keyspace hardcoded as
"config","knowledge","librarian"in various services - Result: Cannot customize keyspace for multi-tenant deployments
- Impact: Config, cores, and librarian services
- Blocks: Multiple tenants cannot use separate Cassandra keyspaces
Issue #4: Collection Management Architecture ✅ COMPLETED
- Previous: Collections stored in Cassandra librarian keyspace via separate collections table
- Previous: Librarian used 4 hardcoded storage management topics to coordinate collection create/delete:
vector_storage_management_topicobject_storage_management_topictriples_storage_management_topicstorage_management_response_topic
- Problems (Resolved):
- Hardcoded topics could not be customized for multi-tenant deployments
- Complex async coordination between librarian and 4+ storage services
- Separate Cassandra table and management infrastructure
- Non-persistent request/response queues for critical operations
- Solution Implemented: Migrated collections to config service storage, use config push for distribution
- Status: All storage backends migrated to
CollectionConfigHandlerpattern
Solution
This spec addresses Issues #1, #2, #3, and #4.
Part 1: Fix Parameter Name Mismatches
Change 1: AsyncProcessor Base Class - Rename CLI Parameter
File: trustgraph-base/trustgraph/base/async_processor.py
Line: 260-264
Current:
parser.add_argument(
'--config-queue',
default=default_config_queue,
help=f'Config push queue {default_config_queue}',
)
Fixed:
parser.add_argument(
'--config-push-queue',
default=default_config_queue,
help=f'Config push queue (default: {default_config_queue})',
)
Rationale:
- Clearer, more explicit naming
- Matches the internal variable name
config_push_queue - Breaking change acceptable since feature is currently non-functional
- No code change needed in params.get() - it already looks for the correct name
Change 2: Config Service - Rename CLI Parameter
File: trustgraph-flow/trustgraph/config/service/service.py
Line: 276-279
Current:
parser.add_argument(
'--push-queue',
default=default_config_push_queue,
help=f'Config push queue (default: {default_config_push_queue})'
)
Fixed:
parser.add_argument(
'--config-push-queue',
default=default_config_push_queue,
help=f'Config push queue (default: {default_config_push_queue})'
)
Rationale:
- Clearer naming - "config-push-queue" is more explicit than just "push-queue"
- Matches the internal variable name
config_push_queue - Consistent with AsyncProcessor's
--config-push-queueparameter - Breaking change acceptable since feature is currently non-functional
- No code change needed in params.get() - it already looks for the correct name
Part 2: Add Cassandra Keyspace Parameterization
Change 3: Add Keyspace Parameter to cassandra_config Module
File: trustgraph-base/trustgraph/base/cassandra_config.py
Add CLI argument (in add_cassandra_args() function):
parser.add_argument(
'--cassandra-keyspace',
default=None,
help='Cassandra keyspace (default: service-specific)'
)
Add environment variable support (in resolve_cassandra_config() function):
keyspace = params.get(
"cassandra_keyspace",
os.environ.get("CASSANDRA_KEYSPACE")
)
Update return value of resolve_cassandra_config():
- Currently returns:
(hosts, username, password) - Change to return:
(hosts, username, password, keyspace)
Rationale:
- Consistent with existing Cassandra configuration pattern
- Available to all services via
add_cassandra_args() - Supports both CLI and environment variable configuration
Change 4: Config Service - Use Parameterized Keyspace
File: trustgraph-flow/trustgraph/config/service/service.py
Line 30 - Remove hardcoded keyspace:
# DELETE THIS LINE:
keyspace = "config"
Lines 69-73 - Update cassandra config resolution:
Current:
cassandra_host, cassandra_username, cassandra_password = \
resolve_cassandra_config(params)
Fixed:
cassandra_host, cassandra_username, cassandra_password, keyspace = \
resolve_cassandra_config(params, default_keyspace="config")
Rationale:
- Maintains backward compatibility with "config" as default
- Allows override via
--cassandra-keyspaceorCASSANDRA_KEYSPACE
Change 5: Cores/Knowledge Service - Use Parameterized Keyspace
File: trustgraph-flow/trustgraph/cores/service.py
Line 37 - Remove hardcoded keyspace:
# DELETE THIS LINE:
keyspace = "knowledge"
Update cassandra config resolution (similar location as config service):
cassandra_host, cassandra_username, cassandra_password, keyspace = \
resolve_cassandra_config(params, default_keyspace="knowledge")
Change 6: Librarian Service - Use Parameterized Keyspace
File: trustgraph-flow/trustgraph/librarian/service.py
Line 51 - Remove hardcoded keyspace:
# DELETE THIS LINE:
keyspace = "librarian"
Update cassandra config resolution (similar location as config service):
cassandra_host, cassandra_username, cassandra_password, keyspace = \
resolve_cassandra_config(params, default_keyspace="librarian")
Part 3: Migrate Collection Management to Config Service
Overview
Migrate collections from Cassandra librarian keyspace to config service storage. This eliminates hardcoded storage management topics and simplifies the architecture by using the existing config push mechanism for distribution.
Current Architecture
API Request → Gateway → Librarian Service
↓
CollectionManager
↓
Cassandra Collections Table (librarian keyspace)
↓
Broadcast to 4 Storage Management Topics (hardcoded)
↓
Wait for 4+ Storage Service Responses
↓
Response to Gateway
New Architecture
API Request → Gateway → Librarian Service
↓
CollectionManager
↓
Config Service API (put/delete/getvalues)
↓
Cassandra Config Table (class='collections', key='user:collection')
↓
Config Push (to all subscribers on config-push-queue)
↓
All Storage Services receive config update independently
Change 7: Collection Manager - Use Config Service API
File: trustgraph-flow/trustgraph/librarian/collection_manager.py
Remove:
LibraryTableStoreusage (Lines 33, 40-41)- Storage management producers initialization (Lines 86-140)
on_storage_responsemethod (Lines 400-430)pending_deletionstracking (Lines 57, 90-96, and usage throughout)
Add:
- Config service client for API calls (request/response pattern)
Config Client Setup:
# In __init__, add config request/response producers/consumers
from trustgraph.schema.services.config import ConfigRequest, ConfigResponse
# Producer for config requests
self.config_request_producer = Producer(
client=pulsar_client,
topic=config_request_queue,
schema=ConfigRequest,
)
# Consumer for config responses (with correlation ID)
self.config_response_consumer = Consumer(
taskgroup=taskgroup,
client=pulsar_client,
flow=None,
topic=config_response_queue,
subscriber=f"{id}-config",
schema=ConfigResponse,
handler=self.on_config_response,
)
# Tracking for pending config requests
self.pending_config_requests = {} # request_id -> asyncio.Event
Modify list_collections (Lines 145-180):
async def list_collections(self, user, tag_filter=None, limit=None):
"""List collections from config service"""
# Send getvalues request to config service
request = ConfigRequest(
id=str(uuid.uuid4()),
operation='getvalues',
type='collections',
)
# Send request and wait for response
response = await self.send_config_request(request)
# Parse collections from response
collections = []
for key, value_json in response.values.items():
if ":" in key:
coll_user, collection = key.split(":", 1)
if coll_user == user:
metadata = json.loads(value_json)
collections.append(CollectionMetadata(**metadata))
# Apply tag filtering in-memory (as before)
if tag_filter:
collections = [c for c in collections if any(tag in c.tags for tag in tag_filter)]
# Apply limit
if limit:
collections = collections[:limit]
return collections
async def send_config_request(self, request):
"""Send config request and wait for response"""
event = asyncio.Event()
self.pending_config_requests[request.id] = event
await self.config_request_producer.send(request)
await event.wait()
return self.pending_config_requests.pop(request.id + "_response")
async def on_config_response(self, message, consumer, flow):
"""Handle config response"""
response = message.value()
if response.id in self.pending_config_requests:
self.pending_config_requests[response.id + "_response"] = response
self.pending_config_requests[response.id].set()
Modify update_collection (Lines 182-312):
async def update_collection(self, user, collection, name, description, tags):
"""Update collection via config service"""
# Create metadata
metadata = CollectionMetadata(
user=user,
collection=collection,
name=name,
description=description,
tags=tags,
)
# Send put request to config service
request = ConfigRequest(
id=str(uuid.uuid4()),
operation='put',
type='collections',
key=f'{user}:{collection}',
value=json.dumps(metadata.to_dict()),
)
response = await self.send_config_request(request)
if response.error:
raise RuntimeError(f"Config update failed: {response.error.message}")
# Config service will trigger config push automatically
# Storage services will receive update and create collections
Modify delete_collection (Lines 314-398):
async def delete_collection(self, user, collection):
"""Delete collection via config service"""
# Send delete request to config service
request = ConfigRequest(
id=str(uuid.uuid4()),
operation='delete',
type='collections',
key=f'{user}:{collection}',
)
response = await self.send_config_request(request)
if response.error:
raise RuntimeError(f"Config delete failed: {response.error.message}")
# Config service will trigger config push automatically
# Storage services will receive update and delete collections
Collection Metadata Format:
- Stored in config table as:
class='collections', key='user:collection' - Value is JSON-serialized CollectionMetadata (without timestamp fields)
- Fields:
user,collection,name,description,tags - Example:
class='collections', key='alice:my-docs', value='{"user":"alice","collection":"my-docs","name":"My Documents","description":"...","tags":["work"]}'
Change 8: Librarian Service - Remove Storage Management Infrastructure
File: trustgraph-flow/trustgraph/librarian/service.py
Remove:
- Storage management producers (Lines 173-190):
vector_storage_management_producerobject_storage_management_producertriples_storage_management_producer
- Storage response consumer (Lines 192-201)
on_storage_responsehandler (Lines 467-473)
Modify:
- CollectionManager initialization (Lines 215-224) - remove storage producer parameters
Note: External collection API remains unchanged:
list-collectionsupdate-collectiondelete-collection
Change 9: Remove Collections Table from LibraryTableStore
File: trustgraph-flow/trustgraph/tables/library.py
Delete:
- Collections table CREATE statement (Lines 114-127)
- Collections prepared statements (Lines 205-240)
- All collection methods (Lines 578-717):
ensure_collection_existslist_collectionsupdate_collectiondelete_collectionget_collectioncreate_collection
Rationale:
- Collections now stored in config table
- Breaking change acceptable - no data migration needed
- Simplifies librarian service significantly
Change 10: Storage Services - Config-Based Collection Management ✅ COMPLETED
Status: All 11 storage backends have been migrated to use CollectionConfigHandler.
Affected Services (11 total):
- Document embeddings: milvus, pinecone, qdrant
- Graph embeddings: milvus, pinecone, qdrant
- Object storage: cassandra
- Triples storage: cassandra, falkordb, memgraph, neo4j
Files:
trustgraph-flow/trustgraph/storage/doc_embeddings/milvus/write.pytrustgraph-flow/trustgraph/storage/doc_embeddings/pinecone/write.pytrustgraph-flow/trustgraph/storage/doc_embeddings/qdrant/write.pytrustgraph-flow/trustgraph/storage/graph_embeddings/milvus/write.pytrustgraph-flow/trustgraph/storage/graph_embeddings/pinecone/write.pytrustgraph-flow/trustgraph/storage/graph_embeddings/qdrant/write.pytrustgraph-flow/trustgraph/storage/objects/cassandra/write.pytrustgraph-flow/trustgraph/storage/triples/cassandra/write.pytrustgraph-flow/trustgraph/storage/triples/falkordb/write.pytrustgraph-flow/trustgraph/storage/triples/memgraph/write.pytrustgraph-flow/trustgraph/storage/triples/neo4j/write.py
Implementation Pattern (all services):
- Register config handler in
__init__:
# Add after AsyncProcessor initialization
self.register_config_handler(self.on_collection_config)
self.known_collections = set() # Track (user, collection) tuples
- Implement config handler:
async def on_collection_config(self, config, version):
"""Handle collection configuration updates"""
logger.info(f"Collection config version: {version}")
if "collections" not in config:
return
# Parse collections from config
# Key format: "user:collection" in config["collections"]
config_collections = set()
for key in config["collections"].keys():
if ":" in key:
user, collection = key.split(":", 1)
config_collections.add((user, collection))
# Determine changes
to_create = config_collections - self.known_collections
to_delete = self.known_collections - config_collections
# Create new collections (idempotent)
for user, collection in to_create:
try:
await self.create_collection_internal(user, collection)
self.known_collections.add((user, collection))
logger.info(f"Created collection: {user}/{collection}")
except Exception as e:
logger.error(f"Failed to create {user}/{collection}: {e}")
# Delete removed collections (idempotent)
for user, collection in to_delete:
try:
await self.delete_collection_internal(user, collection)
self.known_collections.discard((user, collection))
logger.info(f"Deleted collection: {user}/{collection}")
except Exception as e:
logger.error(f"Failed to delete {user}/{collection}: {e}")
- Initialize known collections on startup:
async def start(self):
"""Start the processor"""
await super().start()
await self.sync_known_collections()
async def sync_known_collections(self):
"""Query backend to populate known_collections set"""
# Backend-specific implementation:
# - Milvus/Pinecone/Qdrant: List collections/indexes matching naming pattern
# - Cassandra: Query keyspaces or collection metadata
# - Neo4j/Memgraph/FalkorDB: Query CollectionMetadata nodes
pass
- Refactor existing handler methods:
# Rename and remove response sending:
# handle_create_collection → create_collection_internal
# handle_delete_collection → delete_collection_internal
async def create_collection_internal(self, user, collection):
"""Create collection (idempotent)"""
# Same logic as current handle_create_collection
# But remove response producer calls
# Handle "already exists" gracefully
pass
async def delete_collection_internal(self, user, collection):
"""Delete collection (idempotent)"""
# Same logic as current handle_delete_collection
# But remove response producer calls
# Handle "not found" gracefully
pass
- Remove storage management infrastructure:
- Remove
self.storage_request_consumersetup and start - Remove
self.storage_response_producersetup - Remove
on_storage_managementdispatcher method - Remove metrics for storage management
- Remove imports:
StorageManagementRequest,StorageManagementResponse
- Remove
Backend-Specific Considerations:
-
Vector stores (Milvus, Pinecone, Qdrant): Track logical
(user, collection)inknown_collections, but may create multiple backend collections per dimension. Continue lazy creation pattern. Delete operations must remove all dimension variants. -
Cassandra Objects: Collections are row properties, not structures. Track keyspace-level information.
-
Graph stores (Neo4j, Memgraph, FalkorDB): Query
CollectionMetadatanodes on startup. Create/delete metadata nodes on sync. -
Cassandra Triples: Use
KnowledgeGraphAPI for collection operations.
Key Design Points:
- Eventual consistency: No request/response mechanism, config push is broadcast
- Idempotency: All create/delete operations must be safe to retry
- Error handling: Log errors but don't block config updates
- Self-healing: Failed operations will retry on next config push
- Collection key format:
"user:collection"inconfig["collections"]
Change 11: Update Collection Schema - Remove Timestamps
File: trustgraph-base/trustgraph/schema/services/collection.py
Modify CollectionMetadata (Lines 13-21):
Remove created_at and updated_at fields:
class CollectionMetadata(Record):
user = String()
collection = String()
name = String()
description = String()
tags = Array(String())
# Remove: created_at = String()
# Remove: updated_at = String()
Modify CollectionManagementRequest (Lines 25-47): Remove timestamp fields:
class CollectionManagementRequest(Record):
operation = String()
user = String()
collection = String()
timestamp = String()
name = String()
description = String()
tags = Array(String())
# Remove: created_at = String()
# Remove: updated_at = String()
tag_filter = Array(String())
limit = Integer()
Rationale:
- Timestamps don't add value for collections
- Config service maintains its own version tracking
- Simplifies schema and reduces storage
Benefits of Config Service Migration
- ✅ Eliminates hardcoded storage management topics - Solves multi-tenant blocker
- ✅ Simpler coordination - No complex async waiting for 4+ storage responses
- ✅ Eventual consistency - Storage services update independently via config push
- ✅ Better reliability - Persistent config push vs non-persistent request/response
- ✅ Unified configuration model - Collections treated as configuration
- ✅ Reduces complexity - Removes ~300 lines of coordination code
- ✅ Multi-tenant ready - Config already supports tenant isolation via keyspace
- ✅ Version tracking - Config service version mechanism provides audit trail
Implementation Notes
Backward Compatibility
Parameter Changes:
- CLI parameter renames are breaking changes but acceptable (feature currently non-functional)
- Services work without parameters (use defaults)
- Default keyspaces preserved: "config", "knowledge", "librarian"
- Default queue:
persistent://tg/config/config
Collection Management:
- Breaking change: Collections table removed from librarian keyspace
- No data migration provided - acceptable for this phase
- External collection API unchanged (list/update/delete operations)
- Collection metadata format simplified (timestamps removed)
Testing Requirements
Parameter Testing:
- Verify
--config-push-queueparameter works on graph-embeddings service - Verify
--config-push-queueparameter works on text-completion service - Verify
--config-push-queueparameter works on config service - Verify
--cassandra-keyspaceparameter works for config service - Verify
--cassandra-keyspaceparameter works for cores service - Verify
--cassandra-keyspaceparameter works for librarian service - Verify services work without parameters (uses defaults)
- Verify multi-tenant deployment with custom queue names and keyspace
Collection Management Testing:
9. Verify list-collections operation via config service
10. Verify update-collection creates/updates in config table
11. Verify delete-collection removes from config table
12. Verify config push is triggered on collection updates
13. Verify tag filtering works with config-based storage
14. Verify collection operations work without timestamp fields
Multi-Tenant Deployment Example
# Tenant: tg-dev
graph-embeddings \
-p pulsar+ssl://broker:6651 \
--pulsar-api-key <KEY> \
--config-push-queue persistent://tg-dev/config/config
config-service \
-p pulsar+ssl://broker:6651 \
--pulsar-api-key <KEY> \
--config-push-queue persistent://tg-dev/config/config \
--cassandra-keyspace tg_dev_config
Impact Analysis
Services Affected by Change 1-2 (CLI Parameter Rename)
All services inheriting from AsyncProcessor or FlowProcessor:
- config-service
- cores-service
- librarian-service
- graph-embeddings
- document-embeddings
- text-completion-* (all providers)
- extract-* (all extractors)
- query-* (all query services)
- retrieval-* (all RAG services)
- storage-* (all storage services)
- And 20+ more services
Services Affected by Changes 3-6 (Cassandra Keyspace)
- config-service
- cores-service
- librarian-service
Services Affected by Changes 7-11 (Collection Management)
Immediate Changes:
- librarian-service (collection_manager.py, service.py)
- tables/library.py (collections table removal)
- schema/services/collection.py (timestamp removal)
Completed Changes (Change 10): ✅
- All storage services (11 total) - migrated to config push for collection updates via
CollectionConfigHandler - Storage management schema removed from
storage.py
Future Considerations
Per-User Keyspace Model
Some services use per-user keyspaces dynamically, where each user gets their own Cassandra keyspace:
Services with per-user keyspaces:
- Triples Query Service (
trustgraph-flow/trustgraph/query/triples/cassandra/service.py:65)- Uses
keyspace=query.user
- Uses
- Objects Query Service (
trustgraph-flow/trustgraph/query/objects/cassandra/service.py:479)- Uses
keyspace=self.sanitize_name(user)
- Uses
- KnowledgeGraph Direct Access (
trustgraph-flow/trustgraph/direct/cassandra_kg.py:18)- Default parameter
keyspace="trustgraph"
- Default parameter
Status: These are not modified in this specification.
Future Review Required:
- Evaluate whether per-user keyspace model creates tenant isolation issues
- Consider if multi-tenant deployments need keyspace prefix patterns (e.g.,
tenant_a_user1) - Review for potential user ID collision across tenants
- Assess if single shared keyspace per tenant with user-based row isolation is preferable
Note: This does not block the current multi-tenant implementation but should be reviewed before production multi-tenant deployments.
Implementation Phases
Phase 1: Parameter Fixes (Changes 1-6)
- Fix
--config-push-queueparameter naming - Add
--cassandra-keyspaceparameter support - Outcome: Multi-tenant queue and keyspace configuration enabled
Phase 2: Collection Management Migration (Changes 7-9, 11)
- Migrate collection storage to config service
- Remove collections table from librarian
- Update collection schema (remove timestamps)
- Outcome: Eliminates hardcoded storage management topics, simplifies librarian
Phase 3: Storage Service Updates (Change 10) ✅ COMPLETED
- Updated all storage services to use config push for collections via
CollectionConfigHandler - Removed storage management request/response infrastructure
- Removed legacy schema definitions
- Outcome: Complete config-based collection management achieved
References
- GitHub Issue: https://github.com/trustgraph-ai/trustgraph/issues/582
- Related Files:
trustgraph-base/trustgraph/base/async_processor.pytrustgraph-base/trustgraph/base/cassandra_config.pytrustgraph-base/trustgraph/schema/core/topic.pytrustgraph-base/trustgraph/schema/services/collection.pytrustgraph-flow/trustgraph/config/service/service.pytrustgraph-flow/trustgraph/cores/service.pytrustgraph-flow/trustgraph/librarian/service.pytrustgraph-flow/trustgraph/librarian/collection_manager.pytrustgraph-flow/trustgraph/tables/library.py