mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-29 02:23:44 +02:00
Document chunks not stored in vector store (#665)
- Schema - ChunkEmbeddings now uses chunk_id: str instead of chunk: bytes
- Schema - DocumentEmbeddingsResponse now returns chunk_ids: list[str]
instead of chunks
- Translators - Updated to serialize/deserialize chunk_id
- Clients - DocumentEmbeddingsClient.query() returns chunk_ids
- SDK/API - flow.py, socket_client.py, bulk_client.py updated
- Document embeddings service - Stores chunk_id (document ID) instead
of chunk text
- Storage writers - Qdrant, Milvus, Pinecone store chunk_id in payload
- Query services - Return chunk_id from vector store searches
- Gateway dispatchers - Serialize chunk_id in API responses
- Document RAG - Added librarian client to fetch chunk content from
Garage using chunk_ids
- CLI tools - Updated all three tools:
- invoke_document_embeddings.py - displays chunk_ids, removed
max_chunk_length
- save_doc_embeds.py - exports chunk_id
- load_doc_embeds.py - imports chunk_id
This commit is contained in:
parent
be358efe67
commit
24bbe94136
24 changed files with 331 additions and 91 deletions
136
docs/tech-specs/document-embeddings-chunk-id.md
Normal file
136
docs/tech-specs/document-embeddings-chunk-id.md
Normal file
|
|
@ -0,0 +1,136 @@
|
|||
# Document Embeddings Chunk ID
|
||||
|
||||
## Overview
|
||||
|
||||
Document embeddings storage currently stores chunk text directly in the vector store payload, duplicating data that exists in Garage. This spec replaces chunk text storage with `chunk_id` references.
|
||||
|
||||
## Current State
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class ChunkEmbeddings:
|
||||
chunk: bytes = b""
|
||||
vectors: list[list[float]] = field(default_factory=list)
|
||||
|
||||
@dataclass
|
||||
class DocumentEmbeddingsResponse:
|
||||
error: Error | None = None
|
||||
chunks: list[str] = field(default_factory=list)
|
||||
```
|
||||
|
||||
Vector store payload:
|
||||
```python
|
||||
payload={"doc": chunk} # Duplicates Garage content
|
||||
```
|
||||
|
||||
## Design
|
||||
|
||||
### Schema Changes
|
||||
|
||||
**ChunkEmbeddings** - replace chunk with chunk_id:
|
||||
```python
|
||||
@dataclass
|
||||
class ChunkEmbeddings:
|
||||
chunk_id: str = ""
|
||||
vectors: list[list[float]] = field(default_factory=list)
|
||||
```
|
||||
|
||||
**DocumentEmbeddingsResponse** - return chunk_ids instead of chunks:
|
||||
```python
|
||||
@dataclass
|
||||
class DocumentEmbeddingsResponse:
|
||||
error: Error | None = None
|
||||
chunk_ids: list[str] = field(default_factory=list)
|
||||
```
|
||||
|
||||
### Vector Store Payload
|
||||
|
||||
All stores (Qdrant, Milvus, Pinecone):
|
||||
```python
|
||||
payload={"chunk_id": chunk_id}
|
||||
```
|
||||
|
||||
### Document RAG Changes
|
||||
|
||||
The document RAG processor fetches chunk content from Garage:
|
||||
|
||||
```python
|
||||
# Get chunk_ids from embeddings store
|
||||
chunk_ids = await self.rag.doc_embeddings_client.query(...)
|
||||
|
||||
# Fetch chunk content from Garage
|
||||
docs = []
|
||||
for chunk_id in chunk_ids:
|
||||
content = await self.rag.librarian_client.get_document_content(
|
||||
chunk_id, self.user
|
||||
)
|
||||
docs.append(content)
|
||||
```
|
||||
|
||||
### API/SDK Changes
|
||||
|
||||
**DocumentEmbeddingsClient** returns chunk_ids:
|
||||
```python
|
||||
return resp.chunk_ids # Changed from resp.chunks
|
||||
```
|
||||
|
||||
**Wire format** (DocumentEmbeddingsResponseTranslator):
|
||||
```python
|
||||
result["chunk_ids"] = obj.chunk_ids # Changed from chunks
|
||||
```
|
||||
|
||||
### CLI Changes
|
||||
|
||||
CLI tool displays chunk_ids (callers can fetch content separately if needed).
|
||||
|
||||
## Files to Modify
|
||||
|
||||
### Schema
|
||||
- `trustgraph-base/trustgraph/schema/knowledge/embeddings.py` - ChunkEmbeddings
|
||||
- `trustgraph-base/trustgraph/schema/services/query.py` - DocumentEmbeddingsResponse
|
||||
|
||||
### Messaging/Translators
|
||||
- `trustgraph-base/trustgraph/messaging/translators/embeddings_query.py` - DocumentEmbeddingsResponseTranslator
|
||||
|
||||
### Client
|
||||
- `trustgraph-base/trustgraph/base/document_embeddings_client.py` - return chunk_ids
|
||||
|
||||
### Python SDK/API
|
||||
- `trustgraph-base/trustgraph/api/flow.py` - document_embeddings_query
|
||||
- `trustgraph-base/trustgraph/api/socket_client.py` - document_embeddings_query
|
||||
- `trustgraph-base/trustgraph/api/async_flow.py` - if applicable
|
||||
- `trustgraph-base/trustgraph/api/bulk_client.py` - import/export document embeddings
|
||||
- `trustgraph-base/trustgraph/api/async_bulk_client.py` - import/export document embeddings
|
||||
|
||||
### Embeddings Service
|
||||
- `trustgraph-flow/trustgraph/embeddings/document_embeddings/embeddings.py` - pass chunk_id
|
||||
|
||||
### Storage Writers
|
||||
- `trustgraph-flow/trustgraph/storage/doc_embeddings/qdrant/write.py`
|
||||
- `trustgraph-flow/trustgraph/storage/doc_embeddings/milvus/write.py`
|
||||
- `trustgraph-flow/trustgraph/storage/doc_embeddings/pinecone/write.py`
|
||||
|
||||
### Query Services
|
||||
- `trustgraph-flow/trustgraph/query/doc_embeddings/qdrant/service.py`
|
||||
- `trustgraph-flow/trustgraph/query/doc_embeddings/milvus/service.py`
|
||||
- `trustgraph-flow/trustgraph/query/doc_embeddings/pinecone/service.py`
|
||||
|
||||
### Gateway
|
||||
- `trustgraph-flow/trustgraph/gateway/dispatch/document_embeddings_query.py`
|
||||
- `trustgraph-flow/trustgraph/gateway/dispatch/document_embeddings_export.py`
|
||||
- `trustgraph-flow/trustgraph/gateway/dispatch/document_embeddings_import.py`
|
||||
|
||||
### Document RAG
|
||||
- `trustgraph-flow/trustgraph/retrieval/document_rag/rag.py` - add librarian client
|
||||
- `trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py` - fetch from Garage
|
||||
|
||||
### CLI
|
||||
- `trustgraph-cli/trustgraph/cli/invoke_document_embeddings.py`
|
||||
- `trustgraph-cli/trustgraph/cli/save_doc_embeds.py`
|
||||
- `trustgraph-cli/trustgraph/cli/load_doc_embeds.py`
|
||||
|
||||
## Benefits
|
||||
|
||||
1. Single source of truth - chunk text only in Garage
|
||||
2. Reduced vector store storage
|
||||
3. Enables query-time provenance via chunk_id
|
||||
Loading…
Add table
Add a link
Reference in a new issue