Feat: TrustGraph i18n & Documentation Translation Updates (#781)
Native CLI i18n: The TrustGraph CLI has built-in translation support
that dynamically loads language strings. You can test and use
different languages by simply passing the --lang flag (e.g., --lang
es for Spanish, --lang ru for Russian) or by configuring your
environment's LANG variable.
Automated Docs Translations: This PR introduces autonomously
translated Markdown documentation into several target languages,
including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew,
Arabic, Simplified Chinese, and Russian.
2026-04-14 07:07:58 -04:00
---
layout: default
title: "RAG Streaming Support Technical Specification"
parent: "Tech Specs"
---
2025-11-26 19:47:39 +00:00
# RAG Streaming Support Technical Specification
## Overview
This specification describes adding streaming support to GraphRAG and DocumentRAG services, enabling real-time token-by-token responses for knowledge graph and document retrieval queries. This extends the existing streaming architecture already implemented for LLM text-completion, prompt, and agent services.
## Goals
- **Consistent streaming UX**: Provide the same streaming experience across all TrustGraph services
- **Minimal API changes**: Add streaming support with a single `streaming` flag, following established patterns
- **Backward compatibility**: Maintain existing non-streaming behavior as default
- **Reuse existing infrastructure**: Leverage PromptClient streaming already implemented
- **Gateway support**: Enable streaming through websocket gateway for client applications
## Background
Currently implemented streaming services:
- **LLM text-completion service**: Phase 1 - streaming from LLM providers
- **Prompt service**: Phase 2 - streaming through prompt templates
- **Agent service**: Phase 3-4 - streaming ReAct responses with incremental thought/observation/answer chunks
Current limitations for RAG services:
- GraphRAG and DocumentRAG only support blocking responses
- Users must wait for complete LLM response before seeing any output
- Poor UX for long responses from knowledge graph or document queries
- Inconsistent experience compared to other TrustGraph services
This specification addresses these gaps by adding streaming support to GraphRAG and DocumentRAG. By enabling token-by-token responses, TrustGraph can:
- Provide consistent streaming UX across all query types
- Reduce perceived latency for RAG queries
- Enable better progress feedback for long-running queries
- Support real-time display in client applications
## Technical Design
### Architecture
The RAG streaming implementation leverages existing infrastructure:
1. **PromptClient Streaming** (Already implemented)
- `kg_prompt()` and `document_prompt()` already accept `streaming` and `chunk_callback` parameters
- These call `prompt()` internally with streaming support
- No changes needed to PromptClient
Module: `trustgraph-base/trustgraph/base/prompt_client.py`
2. **GraphRAG Service** (Needs streaming parameter pass-through)
- Add `streaming` parameter to `query()` method
- Pass streaming flag and callbacks to `prompt_client.kg_prompt()`
- GraphRagRequest schema needs `streaming` field
Modules:
- `trustgraph-flow/trustgraph/retrieval/graph_rag/graph_rag.py`
- `trustgraph-flow/trustgraph/retrieval/graph_rag/rag.py` (Processor)
- `trustgraph-base/trustgraph/schema/graph_rag.py` (Request schema)
- `trustgraph-flow/trustgraph/gateway/dispatch/graph_rag.py` (Gateway)
3. **DocumentRAG Service** (Needs streaming parameter pass-through)
- Add `streaming` parameter to `query()` method
- Pass streaming flag and callbacks to `prompt_client.document_prompt()`
- DocumentRagRequest schema needs `streaming` field
Modules:
- `trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py`
- `trustgraph-flow/trustgraph/retrieval/document_rag/rag.py` (Processor)
- `trustgraph-base/trustgraph/schema/document_rag.py` (Request schema)
- `trustgraph-flow/trustgraph/gateway/dispatch/document_rag.py` (Gateway)
### Data Flow
**Non-streaming (current)**:
```
Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=False)
↓
Prompt Service → LLM
↓
Complete response
↓
Client ← Gateway ← RAG Service ← Response
```
**Streaming (proposed)**:
```
Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=True, chunk_callback=cb)
↓
Prompt Service → LLM (streaming)
↓
Chunk → callback → RAG Response (chunk)
↓ ↓
Client ← Gateway ← ────────────────────────────────── Response stream
```
### APIs
**GraphRAG Changes**:
1. **GraphRag.query()** - Add streaming parameters
```python
async def query(
self, query, user, collection,
verbose=False, streaming=False, chunk_callback=None # NEW
):
# ... existing entity/triple retrieval ...
if streaming and chunk_callback:
resp = await self.prompt_client.kg_prompt(
query, kg,
streaming=True,
chunk_callback=chunk_callback
)
else:
resp = await self.prompt_client.kg_prompt(query, kg)
return resp
```
2. **GraphRagRequest schema** - Add streaming field
```python
class GraphRagRequest(Record):
query = String()
user = String()
collection = String()
streaming = Boolean() # NEW
```
3. **GraphRagResponse schema** - Add streaming fields (follow Agent pattern)
```python
class GraphRagResponse(Record):
response = String() # Legacy: complete response
chunk = String() # NEW: streaming chunk
end_of_stream = Boolean() # NEW: indicates last chunk
```
4. **Processor** - Pass streaming through
```python
async def handle(self, msg):
# ... existing code ...
async def send_chunk(chunk):
await self.respond(GraphRagResponse(
chunk=chunk,
end_of_stream=False,
response=None
))
if request.streaming:
full_response = await self.rag.query(
query=request.query,
user=request.user,
collection=request.collection,
streaming=True,
chunk_callback=send_chunk
)
# Send final message
await self.respond(GraphRagResponse(
chunk=None,
end_of_stream=True,
response=full_response
))
else:
# Existing non-streaming path
response = await self.rag.query(...)
await self.respond(GraphRagResponse(response=response))
```
**DocumentRAG Changes**:
Identical pattern to GraphRAG:
1. Add `streaming` and `chunk_callback` parameters to `DocumentRag.query()`
2. Add `streaming` field to `DocumentRagRequest`
3. Add `chunk` and `end_of_stream` fields to `DocumentRagResponse`
4. Update Processor to handle streaming with callbacks
**Gateway Changes**:
Both `graph_rag.py` and `document_rag.py` in gateway/dispatch need updates to forward streaming chunks to websocket:
```python
async def handle(self, message, session, websocket):
# ... existing code ...
if request.streaming:
async def recipient(resp):
if resp.chunk:
await websocket.send(json.dumps({
"id": message["id"],
"response": {"chunk": resp.chunk},
"complete": resp.end_of_stream
}))
return resp.end_of_stream
await self.rag_client.request(request, recipient=recipient)
else:
# Existing non-streaming path
resp = await self.rag_client.request(request)
await websocket.send(...)
```
### Implementation Details
**Implementation order**:
1. Add schema fields (Request + Response for both RAG services)
2. Update GraphRag.query() and DocumentRag.query() methods
3. Update Processors to handle streaming
4. Update Gateway dispatch handlers
5. Add `--no-streaming` flags to `tg-invoke-graph-rag` and `tg-invoke-document-rag` (streaming enabled by default, following agent CLI pattern)
**Callback pattern**:
Follow the same async callback pattern established in Agent streaming:
- Processor defines `async def send_chunk(chunk)` callback
- Passes callback to RAG service
- RAG service passes callback to PromptClient
- PromptClient invokes callback for each LLM chunk
- Processor sends streaming response message for each chunk
**Error handling**:
- Errors during streaming should send error response with `end_of_stream=True`
- Follow existing error propagation patterns from Agent streaming
## Security Considerations
No new security considerations beyond existing RAG services:
- Streaming responses use same user/collection isolation
- No changes to authentication or authorization
- Chunk boundaries don't expose sensitive data
## Performance Considerations
**Benefits**:
- Reduced perceived latency (first tokens arrive faster)
- Better UX for long responses
- Lower memory usage (no need to buffer complete response)
**Potential concerns**:
- More Pulsar messages for streaming responses
- Slightly higher CPU for chunking/callback overhead
- Mitigated by: streaming is opt-in, default remains non-streaming
**Testing considerations**:
- Test with large knowledge graphs (many triples)
- Test with many retrieved documents
- Measure overhead of streaming vs non-streaming
## Testing Strategy
**Unit tests**:
- Test GraphRag.query() with streaming=True/False
- Test DocumentRag.query() with streaming=True/False
- Mock PromptClient to verify callback invocations
**Integration tests**:
- Test full GraphRAG streaming flow (similar to existing agent streaming tests)
- Test full DocumentRAG streaming flow
- Test Gateway streaming forwarding
- Test CLI streaming output
**Manual testing**:
- `tg-invoke-graph-rag -q "What is machine learning?"` (streaming by default)
- `tg-invoke-document-rag -q "Summarize the documents about AI"` (streaming by default)
- `tg-invoke-graph-rag --no-streaming -q "..."` (test non-streaming mode)
- Verify incremental output appears in streaming mode
## Migration Plan
No migration needed:
- Streaming is opt-in via `streaming` parameter (defaults to False)
- Existing clients continue to work unchanged
- New clients can opt into streaming
## Timeline
Estimated implementation: 4-6 hours
- Phase 1 (2 hours): GraphRAG streaming support
- Phase 2 (2 hours): DocumentRAG streaming support
- Phase 3 (1-2 hours): Gateway updates and CLI flags
- Testing: Built into each phase
## Open Questions
- Should we add streaming support to NLP Query service as well?
- Do we want to stream intermediate steps (e.g., "Retrieving entities...", "Querying graph...") or just LLM output?
- Should GraphRAG/DocumentRAG responses include chunk metadata (e.g., chunk number, total expected)?
## References
- Existing implementation: `docs/tech-specs/streaming-llm-responses.md`
- Agent streaming: `trustgraph-flow/trustgraph/agent/react/agent_manager.py`
- PromptClient streaming: `trustgraph-base/trustgraph/base/prompt_client.py`