mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-26 00:46:22 +02:00
Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.
144 lines
4.2 KiB
Markdown
144 lines
4.2 KiB
Markdown
---
|
||
layout: default
|
||
title: "文档嵌入块 ID"
|
||
parent: "Chinese (Beta)"
|
||
---
|
||
|
||
# 文档嵌入块 ID
|
||
|
||
> **Beta Translation:** This document was translated via Machine Learning and as such may not be 100% accurate. All non-English languages are currently classified as Beta.
|
||
|
||
## 概述
|
||
|
||
目前,文档嵌入存储将块文本直接存储在向量存储的负载中,这会重复 Garage 中已存在的数据。此规范将块文本存储替换为对 `chunk_id` 的引用。
|
||
|
||
## 当前状态
|
||
|
||
```python
|
||
@dataclass
|
||
class ChunkEmbeddings:
|
||
chunk: bytes = b""
|
||
vectors: list[list[float]] = field(default_factory=list)
|
||
|
||
@dataclass
|
||
class DocumentEmbeddingsResponse:
|
||
error: Error | None = None
|
||
chunks: list[str] = field(default_factory=list)
|
||
```
|
||
|
||
向量存储负载:
|
||
```python
|
||
payload={"doc": chunk} # Duplicates Garage content
|
||
```
|
||
|
||
## 设计
|
||
|
||
### 模式变更
|
||
|
||
**ChunkEmbeddings** - 将 chunk 替换为 chunk_id:
|
||
```python
|
||
@dataclass
|
||
class ChunkEmbeddings:
|
||
chunk_id: str = ""
|
||
vectors: list[list[float]] = field(default_factory=list)
|
||
```
|
||
|
||
**DocumentEmbeddingsResponse** - 返回 chunk_ids 而不是 chunks:
|
||
```python
|
||
@dataclass
|
||
class DocumentEmbeddingsResponse:
|
||
error: Error | None = None
|
||
chunk_ids: list[str] = field(default_factory=list)
|
||
```
|
||
|
||
### 向量存储负载
|
||
|
||
所有存储(Qdrant、Milvus、Pinecone):
|
||
```python
|
||
payload={"chunk_id": chunk_id}
|
||
```
|
||
|
||
### 文档 RAG 变更
|
||
|
||
文档 RAG 处理器从 Garage 中获取块内容:
|
||
|
||
```python
|
||
# Get chunk_ids from embeddings store
|
||
chunk_ids = await self.rag.doc_embeddings_client.query(...)
|
||
|
||
# Fetch chunk content from Garage
|
||
docs = []
|
||
for chunk_id in chunk_ids:
|
||
content = await self.rag.librarian_client.get_document_content(
|
||
chunk_id, self.user
|
||
)
|
||
docs.append(content)
|
||
```
|
||
|
||
### API/SDK 变更
|
||
|
||
**DocumentEmbeddingsClient** 返回 chunk_ids:
|
||
```python
|
||
return resp.chunk_ids # Changed from resp.chunks
|
||
```
|
||
|
||
**数据格式** (DocumentEmbeddingsResponseTranslator):
|
||
```python
|
||
result["chunk_ids"] = obj.chunk_ids # Changed from chunks
|
||
```
|
||
|
||
### CLI 变更
|
||
|
||
CLI 工具显示 chunk_ids(如果需要,调用者可以单独获取内容)。
|
||
|
||
## 需要修改的文件
|
||
|
||
### Schema
|
||
`trustgraph-base/trustgraph/schema/knowledge/embeddings.py` - ChunkEmbeddings
|
||
`trustgraph-base/trustgraph/schema/services/query.py` - DocumentEmbeddingsResponse
|
||
|
||
### 消息/翻译器
|
||
`trustgraph-base/trustgraph/messaging/translators/embeddings_query.py` - DocumentEmbeddingsResponseTranslator
|
||
|
||
### 客户端
|
||
`trustgraph-base/trustgraph/base/document_embeddings_client.py` - 返回 chunk_ids
|
||
|
||
### Python SDK/API
|
||
`trustgraph-base/trustgraph/api/flow.py` - document_embeddings_query
|
||
`trustgraph-base/trustgraph/api/socket_client.py` - document_embeddings_query
|
||
`trustgraph-base/trustgraph/api/async_flow.py` - 如果适用
|
||
`trustgraph-base/trustgraph/api/bulk_client.py` - 导入/导出文档嵌入
|
||
`trustgraph-base/trustgraph/api/async_bulk_client.py` - 导入/导出文档嵌入
|
||
|
||
### 嵌入服务
|
||
`trustgraph-flow/trustgraph/embeddings/document_embeddings/embeddings.py` - 传递 chunk_id
|
||
|
||
### 存储写入器
|
||
`trustgraph-flow/trustgraph/storage/doc_embeddings/qdrant/write.py`
|
||
`trustgraph-flow/trustgraph/storage/doc_embeddings/milvus/write.py`
|
||
`trustgraph-flow/trustgraph/storage/doc_embeddings/pinecone/write.py`
|
||
|
||
### 查询服务
|
||
`trustgraph-flow/trustgraph/query/doc_embeddings/qdrant/service.py`
|
||
`trustgraph-flow/trustgraph/query/doc_embeddings/milvus/service.py`
|
||
`trustgraph-flow/trustgraph/query/doc_embeddings/pinecone/service.py`
|
||
|
||
### 网关
|
||
`trustgraph-flow/trustgraph/gateway/dispatch/document_embeddings_query.py`
|
||
`trustgraph-flow/trustgraph/gateway/dispatch/document_embeddings_export.py`
|
||
`trustgraph-flow/trustgraph/gateway/dispatch/document_embeddings_import.py`
|
||
|
||
### 文档 RAG
|
||
`trustgraph-flow/trustgraph/retrieval/document_rag/rag.py` - 添加 librarian 客户端
|
||
`trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py` - 从 Garage 获取
|
||
|
||
### CLI
|
||
`trustgraph-cli/trustgraph/cli/invoke_document_embeddings.py`
|
||
`trustgraph-cli/trustgraph/cli/save_doc_embeds.py`
|
||
`trustgraph-cli/trustgraph/cli/load_doc_embeds.py`
|
||
|
||
## 优点
|
||
|
||
1. 单一数据源 - 仅在 Garage 中存储文本块
|
||
2. 减少向量存储空间
|
||
3. 通过 chunk_id 实现查询时的数据溯源
|