mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 16:36:21 +02:00
Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.
142 lines
4.1 KiB
Markdown
142 lines
4.1 KiB
Markdown
---
|
|
layout: default
|
|
title: "Document Embeddings Chunk ID"
|
|
parent: "Tech Specs"
|
|
---
|
|
|
|
# Document Embeddings Chunk ID
|
|
|
|
## Overview
|
|
|
|
Document embeddings storage currently stores chunk text directly in the vector store payload, duplicating data that exists in Garage. This spec replaces chunk text storage with `chunk_id` references.
|
|
|
|
## Current State
|
|
|
|
```python
|
|
@dataclass
|
|
class ChunkEmbeddings:
|
|
chunk: bytes = b""
|
|
vectors: list[list[float]] = field(default_factory=list)
|
|
|
|
@dataclass
|
|
class DocumentEmbeddingsResponse:
|
|
error: Error | None = None
|
|
chunks: list[str] = field(default_factory=list)
|
|
```
|
|
|
|
Vector store payload:
|
|
```python
|
|
payload={"doc": chunk} # Duplicates Garage content
|
|
```
|
|
|
|
## Design
|
|
|
|
### Schema Changes
|
|
|
|
**ChunkEmbeddings** - replace chunk with chunk_id:
|
|
```python
|
|
@dataclass
|
|
class ChunkEmbeddings:
|
|
chunk_id: str = ""
|
|
vectors: list[list[float]] = field(default_factory=list)
|
|
```
|
|
|
|
**DocumentEmbeddingsResponse** - return chunk_ids instead of chunks:
|
|
```python
|
|
@dataclass
|
|
class DocumentEmbeddingsResponse:
|
|
error: Error | None = None
|
|
chunk_ids: list[str] = field(default_factory=list)
|
|
```
|
|
|
|
### Vector Store Payload
|
|
|
|
All stores (Qdrant, Milvus, Pinecone):
|
|
```python
|
|
payload={"chunk_id": chunk_id}
|
|
```
|
|
|
|
### Document RAG Changes
|
|
|
|
The document RAG processor fetches chunk content from Garage:
|
|
|
|
```python
|
|
# Get chunk_ids from embeddings store
|
|
chunk_ids = await self.rag.doc_embeddings_client.query(...)
|
|
|
|
# Fetch chunk content from Garage
|
|
docs = []
|
|
for chunk_id in chunk_ids:
|
|
content = await self.rag.librarian_client.get_document_content(
|
|
chunk_id, self.user
|
|
)
|
|
docs.append(content)
|
|
```
|
|
|
|
### API/SDK Changes
|
|
|
|
**DocumentEmbeddingsClient** returns chunk_ids:
|
|
```python
|
|
return resp.chunk_ids # Changed from resp.chunks
|
|
```
|
|
|
|
**Wire format** (DocumentEmbeddingsResponseTranslator):
|
|
```python
|
|
result["chunk_ids"] = obj.chunk_ids # Changed from chunks
|
|
```
|
|
|
|
### CLI Changes
|
|
|
|
CLI tool displays chunk_ids (callers can fetch content separately if needed).
|
|
|
|
## Files to Modify
|
|
|
|
### Schema
|
|
- `trustgraph-base/trustgraph/schema/knowledge/embeddings.py` - ChunkEmbeddings
|
|
- `trustgraph-base/trustgraph/schema/services/query.py` - DocumentEmbeddingsResponse
|
|
|
|
### Messaging/Translators
|
|
- `trustgraph-base/trustgraph/messaging/translators/embeddings_query.py` - DocumentEmbeddingsResponseTranslator
|
|
|
|
### Client
|
|
- `trustgraph-base/trustgraph/base/document_embeddings_client.py` - return chunk_ids
|
|
|
|
### Python SDK/API
|
|
- `trustgraph-base/trustgraph/api/flow.py` - document_embeddings_query
|
|
- `trustgraph-base/trustgraph/api/socket_client.py` - document_embeddings_query
|
|
- `trustgraph-base/trustgraph/api/async_flow.py` - if applicable
|
|
- `trustgraph-base/trustgraph/api/bulk_client.py` - import/export document embeddings
|
|
- `trustgraph-base/trustgraph/api/async_bulk_client.py` - import/export document embeddings
|
|
|
|
### Embeddings Service
|
|
- `trustgraph-flow/trustgraph/embeddings/document_embeddings/embeddings.py` - pass chunk_id
|
|
|
|
### Storage Writers
|
|
- `trustgraph-flow/trustgraph/storage/doc_embeddings/qdrant/write.py`
|
|
- `trustgraph-flow/trustgraph/storage/doc_embeddings/milvus/write.py`
|
|
- `trustgraph-flow/trustgraph/storage/doc_embeddings/pinecone/write.py`
|
|
|
|
### Query Services
|
|
- `trustgraph-flow/trustgraph/query/doc_embeddings/qdrant/service.py`
|
|
- `trustgraph-flow/trustgraph/query/doc_embeddings/milvus/service.py`
|
|
- `trustgraph-flow/trustgraph/query/doc_embeddings/pinecone/service.py`
|
|
|
|
### Gateway
|
|
- `trustgraph-flow/trustgraph/gateway/dispatch/document_embeddings_query.py`
|
|
- `trustgraph-flow/trustgraph/gateway/dispatch/document_embeddings_export.py`
|
|
- `trustgraph-flow/trustgraph/gateway/dispatch/document_embeddings_import.py`
|
|
|
|
### Document RAG
|
|
- `trustgraph-flow/trustgraph/retrieval/document_rag/rag.py` - add librarian client
|
|
- `trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py` - fetch from Garage
|
|
|
|
### CLI
|
|
- `trustgraph-cli/trustgraph/cli/invoke_document_embeddings.py`
|
|
- `trustgraph-cli/trustgraph/cli/save_doc_embeds.py`
|
|
- `trustgraph-cli/trustgraph/cli/load_doc_embeds.py`
|
|
|
|
## Benefits
|
|
|
|
1. Single source of truth - chunk text only in Garage
|
|
2. Reduced vector store storage
|
|
3. Enables query-time provenance via chunk_id
|