Feat: TrustGraph i18n & Documentation Translation Updates (#781)
Native CLI i18n: The TrustGraph CLI has built-in translation support
that dynamically loads language strings. You can test and use
different languages by simply passing the --lang flag (e.g., --lang
es for Spanish, --lang ru for Russian) or by configuring your
environment's LANG variable.
Automated Docs Translations: This PR introduces autonomously
translated Markdown documentation into several target languages,
including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew,
Arabic, Simplified Chinese, and Russian.
2026-04-14 07:07:58 -04:00
---
layout: default
title: "Document Embeddings Chunk ID"
parent: "Tech Specs"
---
2026-03-07 23:10:45 +00:00
# Document Embeddings Chunk ID
## Overview
Document embeddings storage currently stores chunk text directly in the vector store payload, duplicating data that exists in Garage. This spec replaces chunk text storage with `chunk_id` references.
## Current State
```python
@dataclass
class ChunkEmbeddings:
chunk: bytes = b""
vectors: list[list[float]] = field(default_factory=list)
@dataclass
class DocumentEmbeddingsResponse:
error: Error | None = None
chunks: list[str] = field(default_factory=list)
```
Vector store payload:
```python
payload={"doc": chunk} # Duplicates Garage content
```
## Design
### Schema Changes
**ChunkEmbeddings** - replace chunk with chunk_id:
```python
@dataclass
class ChunkEmbeddings:
chunk_id: str = ""
vectors: list[list[float]] = field(default_factory=list)
```
**DocumentEmbeddingsResponse** - return chunk_ids instead of chunks:
```python
@dataclass
class DocumentEmbeddingsResponse:
error: Error | None = None
chunk_ids: list[str] = field(default_factory=list)
```
### Vector Store Payload
All stores (Qdrant, Milvus, Pinecone):
```python
payload={"chunk_id": chunk_id}
```
### Document RAG Changes
The document RAG processor fetches chunk content from Garage:
```python
# Get chunk_ids from embeddings store
chunk_ids = await self.rag.doc_embeddings_client.query(...)
# Fetch chunk content from Garage
docs = []
for chunk_id in chunk_ids:
content = await self.rag.librarian_client.get_document_content(
chunk_id, self.user
)
docs.append(content)
```
### API/SDK Changes
**DocumentEmbeddingsClient** returns chunk_ids:
```python
return resp.chunk_ids # Changed from resp.chunks
```
**Wire format** (DocumentEmbeddingsResponseTranslator):
```python
result["chunk_ids"] = obj.chunk_ids # Changed from chunks
```
### CLI Changes
CLI tool displays chunk_ids (callers can fetch content separately if needed).
## Files to Modify
### Schema
- `trustgraph-base/trustgraph/schema/knowledge/embeddings.py` - ChunkEmbeddings
- `trustgraph-base/trustgraph/schema/services/query.py` - DocumentEmbeddingsResponse
### Messaging/Translators
- `trustgraph-base/trustgraph/messaging/translators/embeddings_query.py` - DocumentEmbeddingsResponseTranslator
### Client
- `trustgraph-base/trustgraph/base/document_embeddings_client.py` - return chunk_ids
### Python SDK/API
- `trustgraph-base/trustgraph/api/flow.py` - document_embeddings_query
- `trustgraph-base/trustgraph/api/socket_client.py` - document_embeddings_query
- `trustgraph-base/trustgraph/api/async_flow.py` - if applicable
- `trustgraph-base/trustgraph/api/bulk_client.py` - import/export document embeddings
- `trustgraph-base/trustgraph/api/async_bulk_client.py` - import/export document embeddings
### Embeddings Service
- `trustgraph-flow/trustgraph/embeddings/document_embeddings/embeddings.py` - pass chunk_id
### Storage Writers
- `trustgraph-flow/trustgraph/storage/doc_embeddings/qdrant/write.py`
- `trustgraph-flow/trustgraph/storage/doc_embeddings/milvus/write.py`
- `trustgraph-flow/trustgraph/storage/doc_embeddings/pinecone/write.py`
### Query Services
- `trustgraph-flow/trustgraph/query/doc_embeddings/qdrant/service.py`
- `trustgraph-flow/trustgraph/query/doc_embeddings/milvus/service.py`
- `trustgraph-flow/trustgraph/query/doc_embeddings/pinecone/service.py`
### Gateway
- `trustgraph-flow/trustgraph/gateway/dispatch/document_embeddings_query.py`
- `trustgraph-flow/trustgraph/gateway/dispatch/document_embeddings_export.py`
- `trustgraph-flow/trustgraph/gateway/dispatch/document_embeddings_import.py`
### Document RAG
- `trustgraph-flow/trustgraph/retrieval/document_rag/rag.py` - add librarian client
- `trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py` - fetch from Garage
### CLI
- `trustgraph-cli/trustgraph/cli/invoke_document_embeddings.py`
- `trustgraph-cli/trustgraph/cli/save_doc_embeds.py`
- `trustgraph-cli/trustgraph/cli/load_doc_embeds.py`
## Benefits
1. Single source of truth - chunk text only in Garage
2. Reduced vector store storage
3. Enables query-time provenance via chunk_id