Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.
4.6 KiB
| layout | title | parent |
|---|---|---|
| default | Identificador de fragmento de incrustaciones de documentos | Spanish (Beta) |
Identificador de fragmento de incrustaciones de documentos
Beta Translation: This document was translated via Machine Learning and as such may not be 100% accurate. All non-English languages are currently classified as Beta.
Resumen
Actualmente, el almacenamiento de incrustaciones de documentos almacena directamente el texto del fragmento en la carga útil de la base de datos vectorial, duplicando datos que existen en Garage. Esta especificación reemplaza el almacenamiento del texto del fragmento con referencias chunk_id.
Estado actual
@dataclass
class ChunkEmbeddings:
chunk: bytes = b""
vectors: list[list[float]] = field(default_factory=list)
@dataclass
class DocumentEmbeddingsResponse:
error: Error | None = None
chunks: list[str] = field(default_factory=list)
Carga útil del almacén de vectores:
payload={"doc": chunk} # Duplicates Garage content
Diseño
Cambios en el esquema
ChunkEmbeddings - reemplazar "chunk" con "chunk_id":
@dataclass
class ChunkEmbeddings:
chunk_id: str = ""
vectors: list[list[float]] = field(default_factory=list)
DocumentEmbeddingsResponse - devolver chunk_ids en lugar de chunks:
@dataclass
class DocumentEmbeddingsResponse:
error: Error | None = None
chunk_ids: list[str] = field(default_factory=list)
Carga útil del almacén de vectores
Todos los almacenes (Qdrant, Milvus, Pinecone):
payload={"chunk_id": chunk_id}
Cambios en el Documento RAG
El procesador de documentos RAG recupera el contenido de los fragmentos de Garage:
# Get chunk_ids from embeddings store
chunk_ids = await self.rag.doc_embeddings_client.query(...)
# Fetch chunk content from Garage
docs = []
for chunk_id in chunk_ids:
content = await self.rag.librarian_client.get_document_content(
chunk_id, self.user
)
docs.append(content)
Cambios en la API/SDK
DocumentEmbeddingsClient devuelve chunk_ids:
return resp.chunk_ids # Changed from resp.chunks
Formato de cable (DocumentEmbeddingsResponseTranslator):
result["chunk_ids"] = obj.chunk_ids # Changed from chunks
Cambios en la CLI
La herramienta de la CLI muestra los chunk_ids (los usuarios pueden obtener el contenido por separado si es necesario).
Archivos a Modificar
Esquema
trustgraph-base/trustgraph/schema/knowledge/embeddings.py - ChunkEmbeddings
trustgraph-base/trustgraph/schema/services/query.py - DocumentEmbeddingsResponse
Mensajería/Traductores
trustgraph-base/trustgraph/messaging/translators/embeddings_query.py - DocumentEmbeddingsResponseTranslator
Cliente
trustgraph-base/trustgraph/base/document_embeddings_client.py - return chunk_ids
SDK/API de Python
trustgraph-base/trustgraph/api/flow.py - document_embeddings_query
trustgraph-base/trustgraph/api/socket_client.py - document_embeddings_query
trustgraph-base/trustgraph/api/async_flow.py - if applicable
trustgraph-base/trustgraph/api/bulk_client.py - import/export document embeddings
trustgraph-base/trustgraph/api/async_bulk_client.py - import/export document embeddings
Servicio de Embeddings
trustgraph-flow/trustgraph/embeddings/document_embeddings/embeddings.py - pass chunk_id
Escritores de Almacenamiento
trustgraph-flow/trustgraph/storage/doc_embeddings/qdrant/write.py
trustgraph-flow/trustgraph/storage/doc_embeddings/milvus/write.py
trustgraph-flow/trustgraph/storage/doc_embeddings/pinecone/write.py
Servicios de Consulta
trustgraph-flow/trustgraph/query/doc_embeddings/qdrant/service.py
trustgraph-flow/trustgraph/query/doc_embeddings/milvus/service.py
trustgraph-flow/trustgraph/query/doc_embeddings/pinecone/service.py
Gateway
trustgraph-flow/trustgraph/gateway/dispatch/document_embeddings_query.py
trustgraph-flow/trustgraph/gateway/dispatch/document_embeddings_export.py
trustgraph-flow/trustgraph/gateway/dispatch/document_embeddings_import.py
Document RAG
trustgraph-flow/trustgraph/retrieval/document_rag/rag.py - add librarian client
trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py - fetch from Garage
CLI
trustgraph-cli/trustgraph/cli/invoke_document_embeddings.py
trustgraph-cli/trustgraph/cli/save_doc_embeds.py
trustgraph-cli/trustgraph/cli/load_doc_embeds.py
Beneficios
- Única fuente de verdad: solo el texto de los chunks en Garage.
- Almacenamiento de vectores reducido.
- Permite el rastreo de origen en tiempo de consulta a través del chunk_id.