Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.
27 KiB
| layout | title | parent |
|---|---|---|
| default | 技术规范:多租户支持 | Chinese (Beta) |
技术规范:多租户支持
Beta Translation: This document was translated via Machine Learning and as such may not be 100% accurate. All non-English languages are currently classified as Beta.
概述
通过修复参数名称不匹配的问题,从而解决阻止队列自定义的问题,并添加 Cassandra 键空间参数化,以实现多租户部署。
架构上下文
基于流的队列解析
TrustGraph 系统使用基于流的架构进行动态队列解析,该架构本质上支持多租户:
流定义存储在 Cassandra 中,并通过接口定义指定队列名称。
队列名称使用模板,其中包含 {id} 变量,这些变量会被替换为流实例 ID。
服务在请求时动态解析队列,通过查找流配置。
每个租户可以拥有独特的流,具有不同的队列名称,从而提供隔离。
示例流接口定义:
{
"interfaces": {
"triples-store": "persistent://tg/flow/triples-store:{id}",
"graph-embeddings-store": "persistent://tg/flow/graph-embeddings-store:{id}"
}
}
当租户 A 启动流程 tenant-a-prod,而租户 B 启动流程 tenant-b-prod 时,它们会自动获得隔离的队列:
persistent://tg/flow/triples-store:tenant-a-prod
persistent://tg/flow/triples-store:tenant-b-prod
为多租户设计的服务: ✅ 知识管理 (核心) - 动态解析从请求中传递的流程配置中的队列
需要修复的服务: 🔴 配置服务 - 参数名称不匹配,无法自定义队列 🔴 图书管理员服务 - 预定义的存储管理主题(见下文) 🔴 所有服务 - 无法自定义 Cassandra keyspace
问题描述
问题 #1:AsyncProcessor 中的参数名称不匹配
CLI 定义: --config-queue (命名不明确)
Argparse 转换: config_queue (在 params 字典中)
代码查找: config_push_queue
结果: 参数被忽略,默认为 persistent://tg/config/config
影响: 影响所有 32 多个从 AsyncProcessor 继承的服务
问题: 多租户部署无法使用租户特定的配置队列
解决方案: 将 CLI 参数重命名为 --config-push-queue,以提高清晰度(可以接受破坏性更改,因为该功能当前已损坏)
问题 #2:配置服务中的参数名称不匹配
CLI 定义: --push-queue (命名模糊)
Argparse 转换: push_queue (在 params 字典中)
代码查找: config_push_queue
结果: 参数被忽略
影响: 配置服务无法使用自定义推送队列
解决方案: 将 CLI 参数重命名为 --config-push-queue,以提高一致性和清晰度(可以接受破坏性更改)
问题 #3:预定义的 Cassandra Keyspace
当前: Keyspace 在各种服务中硬编码为 "config"、"knowledge"、"librarian"
结果: 无法为多租户部署自定义 keyspace
影响: 配置、核心和图书管理员服务
问题: 多个租户无法使用单独的 Cassandra keyspace
问题 #4:集合管理架构 ✅ 已完成
之前: 集合存储在图书管理员 keyspace 中的单独的集合表中
之前: 图书管理员使用 4 个硬编码的存储管理主题来协调集合创建/删除:
vector_storage_management_topic
object_storage_management_topic
triples_storage_management_topic
storage_management_response_topic
问题(已解决):
无法为多租户部署自定义硬编码主题
图书管理员和 4 个或更多存储服务之间的复杂异步协调
单独的 Cassandra 表和管理基础设施
关键操作的非持久性请求/响应队列
已实施的解决方案: 将集合迁移到配置服务存储,使用配置推送进行分发
状态: 所有存储后端已迁移到 CollectionConfigHandler 模式
解决方案
此规范解决了问题 #1、#2、#3 和 #4。
第一部分:修复参数名称不匹配
更改 1:AsyncProcessor 基类 - 重命名 CLI 参数
文件: trustgraph-base/trustgraph/base/async_processor.py
行: 260-264
当前:
parser.add_argument(
'--config-queue',
default=default_config_queue,
help=f'Config push queue {default_config_queue}',
)
已修复:
parser.add_argument(
'--config-push-queue',
default=default_config_queue,
help=f'Config push queue (default: {default_config_queue})',
)
理由:
命名更清晰、更明确
与内部变量名 config_push_queue 匹配
允许进行重大更改,因为该功能目前不可用
params.get() 不需要任何代码更改,因为它已经查找正确的名称
更改 2:配置服务 - 重命名 CLI 参数
文件: trustgraph-flow/trustgraph/config/service/service.py
行: 276-279
当前:
parser.add_argument(
'--push-queue',
default=default_config_push_queue,
help=f'Config push queue (default: {default_config_push_queue})'
)
固定:
parser.add_argument(
'--config-push-queue',
default=default_config_push_queue,
help=f'Config push queue (default: {default_config_push_queue})'
)
理由:
更清晰的命名 - "config-push-queue" 比仅仅 "push-queue" 更明确。
与内部变量名 config_push_queue 匹配。
与 AsyncProcessor 的 --config-push-queue 参数一致。
即使是重大更改,也是可以接受的,因为该功能目前不可用。
params.get() 中不需要任何代码更改 - 它已经查找正确的名称。
第二部分:添加 Cassandra 键空间参数化
更改 3:向 cassandra_config 模块添加键空间参数
文件: trustgraph-base/trustgraph/base/cassandra_config.py
添加 CLI 参数(在 add_cassandra_args() 函数中):
parser.add_argument(
'--cassandra-keyspace',
default=None,
help='Cassandra keyspace (default: service-specific)'
)
添加环境变量支持 (在 resolve_cassandra_config() 函数中):
keyspace = params.get(
"cassandra_keyspace",
os.environ.get("CASSANDRA_KEYSPACE")
)
更新 resolve_cassandra_config() 的返回值:
当前返回:(hosts, username, password)
更改为返回:(hosts, username, password, keyspace)
理由:
与现有的 Cassandra 配置模式一致
通过 add_cassandra_args() 可供所有服务使用
支持 CLI 和环境变量配置
变更 4:配置服务 - 使用参数化 Keyspace
文件: trustgraph-flow/trustgraph/config/service/service.py
第 30 行 - 移除硬编码的 Keyspace:
# DELETE THIS LINE:
keyspace = "config"
第69-73行 - 更新 Cassandra 配置解析:
当前:
cassandra_host, cassandra_username, cassandra_password = \
resolve_cassandra_config(params)
已修复:
cassandra_host, cassandra_username, cassandra_password, keyspace = \
resolve_cassandra_config(params, default_keyspace="config")
原因:
保持与默认值为 "config" 的配置向后兼容。
允许通过 --cassandra-keyspace 或 CASSANDRA_KEYSPACE 进行覆盖。
变更 5:核心/知识服务 - 使用参数化键空间
文件: trustgraph-flow/trustgraph/cores/service.py
第 37 行 - 移除硬编码的键空间:
# DELETE THIS LINE:
keyspace = "knowledge"
更新 Cassandra 配置解析(位置类似于配置服务):
cassandra_host, cassandra_username, cassandra_password, keyspace = \
resolve_cassandra_config(params, default_keyspace="knowledge")
变更 6:图书管理员服务 - 使用参数化键空间
文件: trustgraph-flow/trustgraph/librarian/service.py
第 51 行 - 移除硬编码的键空间:
# DELETE THIS LINE:
keyspace = "librarian"
更新 Cassandra 配置解析(位置与配置服务类似):
cassandra_host, cassandra_username, cassandra_password, keyspace = \
resolve_cassandra_config(params, default_keyspace="librarian")
第三部分:将集合管理迁移到配置服务
概述
将集合从 Cassandra librarian 键空间迁移到配置服务存储。这消除了硬编码的存储管理主题,并通过使用现有的配置推送机制进行分发,简化了架构。
当前架构
API Request → Gateway → Librarian Service
↓
CollectionManager
↓
Cassandra Collections Table (librarian keyspace)
↓
Broadcast to 4 Storage Management Topics (hardcoded)
↓
Wait for 4+ Storage Service Responses
↓
Response to Gateway
新架构
API Request → Gateway → Librarian Service
↓
CollectionManager
↓
Config Service API (put/delete/getvalues)
↓
Cassandra Config Table (class='collections', key='user:collection')
↓
Config Push (to all subscribers on config-push-queue)
↓
All Storage Services receive config update independently
变更 7:集合管理器 - 使用配置服务 API
文件: trustgraph-flow/trustgraph/librarian/collection_manager.py
移除:
LibraryTableStore 的使用(第 33 行,第 40-41 行)
存储管理生产者初始化(第 86-140 行)
on_storage_response 方法(第 400-430 行)
pending_deletions 跟踪(第 57 行,第 90-96 行,以及整个使用过程)
添加: 用于 API 调用的配置服务客户端(请求/响应模式)
配置客户端设置:
# In __init__, add config request/response producers/consumers
from trustgraph.schema.services.config import ConfigRequest, ConfigResponse
# Producer for config requests
self.config_request_producer = Producer(
client=pulsar_client,
topic=config_request_queue,
schema=ConfigRequest,
)
# Consumer for config responses (with correlation ID)
self.config_response_consumer = Consumer(
taskgroup=taskgroup,
client=pulsar_client,
flow=None,
topic=config_response_queue,
subscriber=f"{id}-config",
schema=ConfigResponse,
handler=self.on_config_response,
)
# Tracking for pending config requests
self.pending_config_requests = {} # request_id -> asyncio.Event
修改 list_collections (第145-180行):
async def list_collections(self, user, tag_filter=None, limit=None):
"""List collections from config service"""
# Send getvalues request to config service
request = ConfigRequest(
id=str(uuid.uuid4()),
operation='getvalues',
type='collections',
)
# Send request and wait for response
response = await self.send_config_request(request)
# Parse collections from response
collections = []
for key, value_json in response.values.items():
if ":" in key:
coll_user, collection = key.split(":", 1)
if coll_user == user:
metadata = json.loads(value_json)
collections.append(CollectionMetadata(**metadata))
# Apply tag filtering in-memory (as before)
if tag_filter:
collections = [c for c in collections if any(tag in c.tags for tag in tag_filter)]
# Apply limit
if limit:
collections = collections[:limit]
return collections
async def send_config_request(self, request):
"""Send config request and wait for response"""
event = asyncio.Event()
self.pending_config_requests[request.id] = event
await self.config_request_producer.send(request)
await event.wait()
return self.pending_config_requests.pop(request.id + "_response")
async def on_config_response(self, message, consumer, flow):
"""Handle config response"""
response = message.value()
if response.id in self.pending_config_requests:
self.pending_config_requests[response.id + "_response"] = response
self.pending_config_requests[response.id].set()
修改 update_collection (第182-312行):
async def update_collection(self, user, collection, name, description, tags):
"""Update collection via config service"""
# Create metadata
metadata = CollectionMetadata(
user=user,
collection=collection,
name=name,
description=description,
tags=tags,
)
# Send put request to config service
request = ConfigRequest(
id=str(uuid.uuid4()),
operation='put',
type='collections',
key=f'{user}:{collection}',
value=json.dumps(metadata.to_dict()),
)
response = await self.send_config_request(request)
if response.error:
raise RuntimeError(f"Config update failed: {response.error.message}")
# Config service will trigger config push automatically
# Storage services will receive update and create collections
修改 delete_collection (第314-398行):
async def delete_collection(self, user, collection):
"""Delete collection via config service"""
# Send delete request to config service
request = ConfigRequest(
id=str(uuid.uuid4()),
operation='delete',
type='collections',
key=f'{user}:{collection}',
)
response = await self.send_config_request(request)
if response.error:
raise RuntimeError(f"Config delete failed: {response.error.message}")
# Config service will trigger config push automatically
# Storage services will receive update and delete collections
集合元数据格式:
存储在配置表中,格式为:class='collections', key='user:collection'
值是 JSON 序列化的 CollectionMetadata (不包含时间戳字段)
字段:user, collection, name, description, tags
示例:class='collections', key='alice:my-docs', value='{"user":"alice","collection":"my-docs","name":"My Documents","description":"...","tags":["work"]}'
变更 8: Librarian Service - 移除存储管理基础设施
文件: trustgraph-flow/trustgraph/librarian/service.py
移除:
存储管理生产者 (173-190 行):
vector_storage_management_producer
object_storage_management_producer
triples_storage_management_producer
存储响应消费者 (192-201 行)
on_storage_response 处理程序 (467-473 行)
修改: CollectionManager 初始化 (215-224 行) - 移除存储生产者参数
注意: 外部集合 API 保持不变:
list-collections
update-collection
delete-collection
变更 9: 从 LibraryTableStore 中移除 Collections 表
文件: trustgraph-flow/trustgraph/tables/library.py
删除:
Collections 表的 CREATE 语句 (114-127 行)
Collections 预处理语句 (205-240 行)
所有集合方法 (578-717 行):
ensure_collection_exists
list_collections
update_collection
delete_collection
get_collection
create_collection
原因: 集合现在存储在配置表中 这是一个破坏性变更,但无需数据迁移 显著简化了 librarian service
变更 10: 存储服务 - 基于配置的集合管理 ✅ 已完成
状态: 所有 11 个存储后端都已迁移到使用 CollectionConfigHandler。
受影响的服务 (总共 11 个): 文档嵌入: milvus, pinecone, qdrant 图嵌入: milvus, pinecone, qdrant 对象存储: cassandra 三元组存储: cassandra, falkordb, memgraph, neo4j
文件:
trustgraph-flow/trustgraph/storage/doc_embeddings/milvus/write.py
trustgraph-flow/trustgraph/storage/doc_embeddings/pinecone/write.py
trustgraph-flow/trustgraph/storage/doc_embeddings/qdrant/write.py
trustgraph-flow/trustgraph/storage/graph_embeddings/milvus/write.py
trustgraph-flow/trustgraph/storage/graph_embeddings/pinecone/write.py
trustgraph-flow/trustgraph/storage/graph_embeddings/qdrant/write.py
trustgraph-flow/trustgraph/storage/objects/cassandra/write.py
trustgraph-flow/trustgraph/storage/triples/cassandra/write.py
trustgraph-flow/trustgraph/storage/triples/falkordb/write.py
trustgraph-flow/trustgraph/storage/triples/memgraph/write.py
trustgraph-flow/trustgraph/storage/triples/neo4j/write.py
实现模式 (所有服务):
- 在
__init__中注册配置处理程序:
# Add after AsyncProcessor initialization
self.register_config_handler(self.on_collection_config)
self.known_collections = set() # Track (user, collection) tuples
- 实现配置处理器:
async def on_collection_config(self, config, version):
"""Handle collection configuration updates"""
logger.info(f"Collection config version: {version}")
if "collections" not in config:
return
# Parse collections from config
# Key format: "user:collection" in config["collections"]
config_collections = set()
for key in config["collections"].keys():
if ":" in key:
user, collection = key.split(":", 1)
config_collections.add((user, collection))
# Determine changes
to_create = config_collections - self.known_collections
to_delete = self.known_collections - config_collections
# Create new collections (idempotent)
for user, collection in to_create:
try:
await self.create_collection_internal(user, collection)
self.known_collections.add((user, collection))
logger.info(f"Created collection: {user}/{collection}")
except Exception as e:
logger.error(f"Failed to create {user}/{collection}: {e}")
# Delete removed collections (idempotent)
for user, collection in to_delete:
try:
await self.delete_collection_internal(user, collection)
self.known_collections.discard((user, collection))
logger.info(f"Deleted collection: {user}/{collection}")
except Exception as e:
logger.error(f"Failed to delete {user}/{collection}: {e}")
- 初始化启动时的已知集合:
async def start(self):
"""Start the processor"""
await super().start()
await self.sync_known_collections()
async def sync_known_collections(self):
"""Query backend to populate known_collections set"""
# Backend-specific implementation:
# - Milvus/Pinecone/Qdrant: List collections/indexes matching naming pattern
# - Cassandra: Query keyspaces or collection metadata
# - Neo4j/Memgraph/FalkorDB: Query CollectionMetadata nodes
pass
- 重构现有的处理方法:
# Rename and remove response sending:
# handle_create_collection → create_collection_internal
# handle_delete_collection → delete_collection_internal
async def create_collection_internal(self, user, collection):
"""Create collection (idempotent)"""
# Same logic as current handle_create_collection
# But remove response producer calls
# Handle "already exists" gracefully
pass
async def delete_collection_internal(self, user, collection):
"""Delete collection (idempotent)"""
# Same logic as current handle_delete_collection
# But remove response producer calls
# Handle "not found" gracefully
pass
- 移除存储管理基础设施:
移除
self.storage_request_consumer的配置和启动 移除self.storage_response_producer的配置 移除on_storage_management的调度器方法 移除存储管理的指标 移除导入:StorageManagementRequest,StorageManagementResponse
后端特定注意事项:
向量存储 (Milvus, Pinecone, Qdrant): 跟踪 (user, collection) 在 known_collections 中的逻辑,但可能会为每个维度创建多个后端集合。继续采用延迟创建模式。删除操作必须删除所有维度变体。
Cassandra Objects: 集合是行属性,而不是结构。跟踪键空间级别的信息。
图数据库 (Neo4j, Memgraph, FalkorDB): 启动时查询 CollectionMetadata 节点。在同步时创建/删除元数据节点。
Cassandra 三元组: 使用 KnowledgeGraph API 进行集合操作。
关键设计要点:
最终一致性: 没有请求/响应机制,配置推送是广播的
幂等性: 所有创建/删除操作都必须可以安全重试
错误处理: 记录错误,但不要阻止配置更新
自愈: 失败的操作将在下一次配置推送时重试
集合键格式: "user:collection" 在 config["collections"] 中
变更 11:更新集合模式 - 移除时间戳
文件: trustgraph-base/trustgraph/schema/services/collection.py
修改 CollectionMetadata (第 13-21 行):
移除 created_at 和 updated_at 字段:
class CollectionMetadata(Record):
user = String()
collection = String()
name = String()
description = String()
tags = Array(String())
# Remove: created_at = String()
# Remove: updated_at = String()
修改 CollectionManagementRequest (第 25-47 行): 移除时间戳字段:
class CollectionManagementRequest(Record):
operation = String()
user = String()
collection = String()
timestamp = String()
name = String()
description = String()
tags = Array(String())
# Remove: created_at = String()
# Remove: updated_at = String()
tag_filter = Array(String())
limit = Integer()
Rationale: Timestamps don't add value for collections Config service maintains its own version tracking Simplifies schema and reduces storage
Benefits of Config Service Migration
- ✅ Eliminates hardcoded storage management topics - Solves multi-tenant blocker
- ✅ Simpler coordination - No complex async waiting for 4+ storage responses
- ✅ Eventual consistency - Storage services update independently via config push
- ✅ Better reliability - Persistent config push vs non-persistent request/response
- ✅ Unified configuration model - Collections treated as configuration
- ✅ Reduces complexity - Removes ~300 lines of coordination code
- ✅ Multi-tenant ready - Config already supports tenant isolation via keyspace
- ✅ Version tracking - Config service version mechanism provides audit trail
Implementation Notes
Backward Compatibility
Parameter Changes:
CLI parameter renames are breaking changes but acceptable (feature currently non-functional)
Services work without parameters (use defaults)
Default keyspaces preserved: "config", "knowledge", "librarian"
Default queue: persistent://tg/config/config
Collection Management: Breaking change: Collections table removed from librarian keyspace No data migration provided - acceptable for this phase External collection API unchanged (list/update/delete operations) Collection metadata format simplified (timestamps removed)
Testing Requirements
Parameter Testing:
- Verify
--config-push-queueparameter works on graph-embeddings service - Verify
--config-push-queueparameter works on text-completion service - Verify
--config-push-queueparameter works on config service - Verify
--cassandra-keyspaceparameter works for config service - Verify
--cassandra-keyspaceparameter works for cores service - Verify
--cassandra-keyspaceparameter works for librarian service - Verify services work without parameters (uses defaults)
- Verify multi-tenant deployment with custom queue names and keyspace
Collection Management Testing:
9. Verify list-collections operation via config service
10. Verify update-collection creates/updates in config table
11. Verify delete-collection removes from config table
12. Verify config push is triggered on collection updates
13. Verify tag filtering works with config-based storage
14. Verify collection operations work without timestamp fields
Multi-Tenant Deployment Example
# Tenant: tg-dev
graph-embeddings \
-p pulsar+ssl://broker:6651 \
--pulsar-api-key <KEY> \
--config-push-queue persistent://tg-dev/config/config
config-service \
-p pulsar+ssl://broker:6651 \
--pulsar-api-key <KEY> \
--config-push-queue persistent://tg-dev/config/config \
--cassandra-keyspace tg_dev_config
影响分析
受变更 1-2 影响的服务 (CLI 参数重命名)
所有继承自 AsyncProcessor 或 FlowProcessor 的服务: config-service cores-service librarian-service graph-embeddings document-embeddings text-completion-* (所有提供者) extract-* (所有提取器) query-* (所有查询服务) retrieval-* (所有 RAG 服务) storage-* (所有存储服务) 还有 20 多个服务
受变更 3-6 影响的服务 (Cassandra Keyspace)
config-service cores-service librarian-service
受变更 7-11 影响的服务 (集合管理)
即时变更: librarian-service (collection_manager.py, service.py) tables/library.py (删除 collections 表) schema/services/collection.py (删除时间戳)
已完成的变更 (变更 10): ✅
所有存储服务 (共 11 个) - 已迁移到配置推送,用于通过 CollectionConfigHandler 更新集合
存储管理模式已从 storage.py 中移除
未来考虑
基于用户的 Keyspace 模式
一些服务使用 基于用户的 Keyspace 动态模式,其中每个用户都拥有自己的 Cassandra Keyspace:
使用基于用户的 Keyspace 的服务:
- 三元组查询服务 (
trustgraph-flow/trustgraph/query/triples/cassandra/service.py:65) 使用keyspace=query.user - 对象查询服务 (
trustgraph-flow/trustgraph/query/objects/cassandra/service.py:479) 使用keyspace=self.sanitize_name(user) - 知识图谱直接访问 (
trustgraph-flow/trustgraph/direct/cassandra_kg.py:18) 默认参数keyspace="trustgraph"
状态: 这些 未进行修改,在本规范中。
需要未来审查:
评估基于用户的 Keyspace 模式是否会产生租户隔离问题
考虑是否需要为多租户部署使用 Keyspace 前缀模式 (例如,tenant_a_user1)
审查是否存在用户 ID 在租户之间的冲突
评估是否更倾向于使用单个共享 Keyspace,每个租户使用基于用户的行隔离
注意: 这不会阻止当前的 multi-tenant 实现,但在进行生产 multi-tenant 部署之前应进行审查。
实施阶段
第一阶段:参数修复 (变更 1-6)
修复 --config-push-queue 参数命名
添加 --cassandra-keyspace 参数支持
结果: 启用了 multi-tenant 队列和 Keyspace 配置
第二阶段:集合管理迁移 (变更 7-9, 11)
将集合存储迁移到配置服务 从 librarian 中删除 collections 表 更新集合模式 (删除时间戳) 结果: 消除硬编码的存储管理主题,简化 librarian
第三阶段:存储服务更新 (变更 10) ✅ 已完成
所有存储服务已更新为使用配置推送进行集合管理,通过 CollectionConfigHandler
移除了存储管理请求/响应基础设施
移除了旧的模式定义
结果: 实现了基于配置的集合管理
引用
GitHub Issue: https://github.com/trustgraph-ai/trustgraph/issues/582
相关文件:
trustgraph-base/trustgraph/base/async_processor.py
trustgraph-base/trustgraph/base/cassandra_config.py
trustgraph-base/trustgraph/schema/core/topic.py
trustgraph-base/trustgraph/schema/services/collection.py
trustgraph-flow/trustgraph/config/service/service.py
trustgraph-flow/trustgraph/cores/service.py
trustgraph-flow/trustgraph/librarian/service.py
trustgraph-flow/trustgraph/librarian/collection_manager.py
trustgraph-flow/trustgraph/tables/library.py