mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 16:36:21 +02:00
- Schema - ChunkEmbeddings now uses chunk_id: str instead of chunk: bytes
- Schema - DocumentEmbeddingsResponse now returns chunk_ids: list[str]
instead of chunks
- Translators - Updated to serialize/deserialize chunk_id
- Clients - DocumentEmbeddingsClient.query() returns chunk_ids
- SDK/API - flow.py, socket_client.py, bulk_client.py updated
- Document embeddings service - Stores chunk_id (document ID) instead
of chunk text
- Storage writers - Qdrant, Milvus, Pinecone store chunk_id in payload
- Query services - Return chunk_id from vector store searches
- Gateway dispatchers - Serialize chunk_id in API responses
- Document RAG - Added librarian client to fetch chunk content from
Garage using chunk_ids
- CLI tools - Updated all three tools:
- invoke_document_embeddings.py - displays chunk_ids, removed
max_chunk_length
- save_doc_embeds.py - exports chunk_id
- load_doc_embeds.py - imports chunk_id
4 KiB
4 KiB
Document Embeddings Chunk ID
Overview
Document embeddings storage currently stores chunk text directly in the vector store payload, duplicating data that exists in Garage. This spec replaces chunk text storage with chunk_id references.
Current State
@dataclass
class ChunkEmbeddings:
chunk: bytes = b""
vectors: list[list[float]] = field(default_factory=list)
@dataclass
class DocumentEmbeddingsResponse:
error: Error | None = None
chunks: list[str] = field(default_factory=list)
Vector store payload:
payload={"doc": chunk} # Duplicates Garage content
Design
Schema Changes
ChunkEmbeddings - replace chunk with chunk_id:
@dataclass
class ChunkEmbeddings:
chunk_id: str = ""
vectors: list[list[float]] = field(default_factory=list)
DocumentEmbeddingsResponse - return chunk_ids instead of chunks:
@dataclass
class DocumentEmbeddingsResponse:
error: Error | None = None
chunk_ids: list[str] = field(default_factory=list)
Vector Store Payload
All stores (Qdrant, Milvus, Pinecone):
payload={"chunk_id": chunk_id}
Document RAG Changes
The document RAG processor fetches chunk content from Garage:
# Get chunk_ids from embeddings store
chunk_ids = await self.rag.doc_embeddings_client.query(...)
# Fetch chunk content from Garage
docs = []
for chunk_id in chunk_ids:
content = await self.rag.librarian_client.get_document_content(
chunk_id, self.user
)
docs.append(content)
API/SDK Changes
DocumentEmbeddingsClient returns chunk_ids:
return resp.chunk_ids # Changed from resp.chunks
Wire format (DocumentEmbeddingsResponseTranslator):
result["chunk_ids"] = obj.chunk_ids # Changed from chunks
CLI Changes
CLI tool displays chunk_ids (callers can fetch content separately if needed).
Files to Modify
Schema
trustgraph-base/trustgraph/schema/knowledge/embeddings.py- ChunkEmbeddingstrustgraph-base/trustgraph/schema/services/query.py- DocumentEmbeddingsResponse
Messaging/Translators
trustgraph-base/trustgraph/messaging/translators/embeddings_query.py- DocumentEmbeddingsResponseTranslator
Client
trustgraph-base/trustgraph/base/document_embeddings_client.py- return chunk_ids
Python SDK/API
trustgraph-base/trustgraph/api/flow.py- document_embeddings_querytrustgraph-base/trustgraph/api/socket_client.py- document_embeddings_querytrustgraph-base/trustgraph/api/async_flow.py- if applicabletrustgraph-base/trustgraph/api/bulk_client.py- import/export document embeddingstrustgraph-base/trustgraph/api/async_bulk_client.py- import/export document embeddings
Embeddings Service
trustgraph-flow/trustgraph/embeddings/document_embeddings/embeddings.py- pass chunk_id
Storage Writers
trustgraph-flow/trustgraph/storage/doc_embeddings/qdrant/write.pytrustgraph-flow/trustgraph/storage/doc_embeddings/milvus/write.pytrustgraph-flow/trustgraph/storage/doc_embeddings/pinecone/write.py
Query Services
trustgraph-flow/trustgraph/query/doc_embeddings/qdrant/service.pytrustgraph-flow/trustgraph/query/doc_embeddings/milvus/service.pytrustgraph-flow/trustgraph/query/doc_embeddings/pinecone/service.py
Gateway
trustgraph-flow/trustgraph/gateway/dispatch/document_embeddings_query.pytrustgraph-flow/trustgraph/gateway/dispatch/document_embeddings_export.pytrustgraph-flow/trustgraph/gateway/dispatch/document_embeddings_import.py
Document RAG
trustgraph-flow/trustgraph/retrieval/document_rag/rag.py- add librarian clienttrustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py- fetch from Garage
CLI
trustgraph-cli/trustgraph/cli/invoke_document_embeddings.pytrustgraph-cli/trustgraph/cli/save_doc_embeds.pytrustgraph-cli/trustgraph/cli/load_doc_embeds.py
Benefits
- Single source of truth - chunk text only in Garage
- Reduced vector store storage
- Enables query-time provenance via chunk_id