Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.
10 KiB
| layout | title | parent |
|---|---|---|
| default | RAG Streaming Support Technical Specification | Tech Specs |
RAG Streaming Support Technical Specification
Overview
This specification describes adding streaming support to GraphRAG and DocumentRAG services, enabling real-time token-by-token responses for knowledge graph and document retrieval queries. This extends the existing streaming architecture already implemented for LLM text-completion, prompt, and agent services.
Goals
- Consistent streaming UX: Provide the same streaming experience across all TrustGraph services
- Minimal API changes: Add streaming support with a single
streamingflag, following established patterns - Backward compatibility: Maintain existing non-streaming behavior as default
- Reuse existing infrastructure: Leverage PromptClient streaming already implemented
- Gateway support: Enable streaming through websocket gateway for client applications
Background
Currently implemented streaming services:
- LLM text-completion service: Phase 1 - streaming from LLM providers
- Prompt service: Phase 2 - streaming through prompt templates
- Agent service: Phase 3-4 - streaming ReAct responses with incremental thought/observation/answer chunks
Current limitations for RAG services:
- GraphRAG and DocumentRAG only support blocking responses
- Users must wait for complete LLM response before seeing any output
- Poor UX for long responses from knowledge graph or document queries
- Inconsistent experience compared to other TrustGraph services
This specification addresses these gaps by adding streaming support to GraphRAG and DocumentRAG. By enabling token-by-token responses, TrustGraph can:
- Provide consistent streaming UX across all query types
- Reduce perceived latency for RAG queries
- Enable better progress feedback for long-running queries
- Support real-time display in client applications
Technical Design
Architecture
The RAG streaming implementation leverages existing infrastructure:
-
PromptClient Streaming (Already implemented)
kg_prompt()anddocument_prompt()already acceptstreamingandchunk_callbackparameters- These call
prompt()internally with streaming support - No changes needed to PromptClient
Module:
trustgraph-base/trustgraph/base/prompt_client.py -
GraphRAG Service (Needs streaming parameter pass-through)
- Add
streamingparameter toquery()method - Pass streaming flag and callbacks to
prompt_client.kg_prompt() - GraphRagRequest schema needs
streamingfield
Modules:
trustgraph-flow/trustgraph/retrieval/graph_rag/graph_rag.pytrustgraph-flow/trustgraph/retrieval/graph_rag/rag.py(Processor)trustgraph-base/trustgraph/schema/graph_rag.py(Request schema)trustgraph-flow/trustgraph/gateway/dispatch/graph_rag.py(Gateway)
- Add
-
DocumentRAG Service (Needs streaming parameter pass-through)
- Add
streamingparameter toquery()method - Pass streaming flag and callbacks to
prompt_client.document_prompt() - DocumentRagRequest schema needs
streamingfield
Modules:
trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.pytrustgraph-flow/trustgraph/retrieval/document_rag/rag.py(Processor)trustgraph-base/trustgraph/schema/document_rag.py(Request schema)trustgraph-flow/trustgraph/gateway/dispatch/document_rag.py(Gateway)
- Add
Data Flow
Non-streaming (current):
Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=False)
↓
Prompt Service → LLM
↓
Complete response
↓
Client ← Gateway ← RAG Service ← Response
Streaming (proposed):
Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=True, chunk_callback=cb)
↓
Prompt Service → LLM (streaming)
↓
Chunk → callback → RAG Response (chunk)
↓ ↓
Client ← Gateway ← ────────────────────────────────── Response stream
APIs
GraphRAG Changes:
- GraphRag.query() - Add streaming parameters
async def query(
self, query, user, collection,
verbose=False, streaming=False, chunk_callback=None # NEW
):
# ... existing entity/triple retrieval ...
if streaming and chunk_callback:
resp = await self.prompt_client.kg_prompt(
query, kg,
streaming=True,
chunk_callback=chunk_callback
)
else:
resp = await self.prompt_client.kg_prompt(query, kg)
return resp
- GraphRagRequest schema - Add streaming field
class GraphRagRequest(Record):
query = String()
user = String()
collection = String()
streaming = Boolean() # NEW
- GraphRagResponse schema - Add streaming fields (follow Agent pattern)
class GraphRagResponse(Record):
response = String() # Legacy: complete response
chunk = String() # NEW: streaming chunk
end_of_stream = Boolean() # NEW: indicates last chunk
- Processor - Pass streaming through
async def handle(self, msg):
# ... existing code ...
async def send_chunk(chunk):
await self.respond(GraphRagResponse(
chunk=chunk,
end_of_stream=False,
response=None
))
if request.streaming:
full_response = await self.rag.query(
query=request.query,
user=request.user,
collection=request.collection,
streaming=True,
chunk_callback=send_chunk
)
# Send final message
await self.respond(GraphRagResponse(
chunk=None,
end_of_stream=True,
response=full_response
))
else:
# Existing non-streaming path
response = await self.rag.query(...)
await self.respond(GraphRagResponse(response=response))
DocumentRAG Changes:
Identical pattern to GraphRAG:
- Add
streamingandchunk_callbackparameters toDocumentRag.query() - Add
streamingfield toDocumentRagRequest - Add
chunkandend_of_streamfields toDocumentRagResponse - Update Processor to handle streaming with callbacks
Gateway Changes:
Both graph_rag.py and document_rag.py in gateway/dispatch need updates to forward streaming chunks to websocket:
async def handle(self, message, session, websocket):
# ... existing code ...
if request.streaming:
async def recipient(resp):
if resp.chunk:
await websocket.send(json.dumps({
"id": message["id"],
"response": {"chunk": resp.chunk},
"complete": resp.end_of_stream
}))
return resp.end_of_stream
await self.rag_client.request(request, recipient=recipient)
else:
# Existing non-streaming path
resp = await self.rag_client.request(request)
await websocket.send(...)
Implementation Details
Implementation order:
- Add schema fields (Request + Response for both RAG services)
- Update GraphRag.query() and DocumentRag.query() methods
- Update Processors to handle streaming
- Update Gateway dispatch handlers
- Add
--no-streamingflags totg-invoke-graph-ragandtg-invoke-document-rag(streaming enabled by default, following agent CLI pattern)
Callback pattern: Follow the same async callback pattern established in Agent streaming:
- Processor defines
async def send_chunk(chunk)callback - Passes callback to RAG service
- RAG service passes callback to PromptClient
- PromptClient invokes callback for each LLM chunk
- Processor sends streaming response message for each chunk
Error handling:
- Errors during streaming should send error response with
end_of_stream=True - Follow existing error propagation patterns from Agent streaming
Security Considerations
No new security considerations beyond existing RAG services:
- Streaming responses use same user/collection isolation
- No changes to authentication or authorization
- Chunk boundaries don't expose sensitive data
Performance Considerations
Benefits:
- Reduced perceived latency (first tokens arrive faster)
- Better UX for long responses
- Lower memory usage (no need to buffer complete response)
Potential concerns:
- More Pulsar messages for streaming responses
- Slightly higher CPU for chunking/callback overhead
- Mitigated by: streaming is opt-in, default remains non-streaming
Testing considerations:
- Test with large knowledge graphs (many triples)
- Test with many retrieved documents
- Measure overhead of streaming vs non-streaming
Testing Strategy
Unit tests:
- Test GraphRag.query() with streaming=True/False
- Test DocumentRag.query() with streaming=True/False
- Mock PromptClient to verify callback invocations
Integration tests:
- Test full GraphRAG streaming flow (similar to existing agent streaming tests)
- Test full DocumentRAG streaming flow
- Test Gateway streaming forwarding
- Test CLI streaming output
Manual testing:
tg-invoke-graph-rag -q "What is machine learning?"(streaming by default)tg-invoke-document-rag -q "Summarize the documents about AI"(streaming by default)tg-invoke-graph-rag --no-streaming -q "..."(test non-streaming mode)- Verify incremental output appears in streaming mode
Migration Plan
No migration needed:
- Streaming is opt-in via
streamingparameter (defaults to False) - Existing clients continue to work unchanged
- New clients can opt into streaming
Timeline
Estimated implementation: 4-6 hours
- Phase 1 (2 hours): GraphRAG streaming support
- Phase 2 (2 hours): DocumentRAG streaming support
- Phase 3 (1-2 hours): Gateway updates and CLI flags
- Testing: Built into each phase
Open Questions
- Should we add streaming support to NLP Query service as well?
- Do we want to stream intermediate steps (e.g., "Retrieving entities...", "Querying graph...") or just LLM output?
- Should GraphRAG/DocumentRAG responses include chunk metadata (e.g., chunk number, total expected)?
References
- Existing implementation:
docs/tech-specs/streaming-llm-responses.md - Agent streaming:
trustgraph-flow/trustgraph/agent/react/agent_manager.py - PromptClient streaming:
trustgraph-base/trustgraph/base/prompt_client.py