--- layout: default title: "RAG Streaming Support Technical Specification" parent: "Tech Specs" --- # RAG Streaming Support Technical Specification ## Overview This specification describes adding streaming support to GraphRAG and DocumentRAG services, enabling real-time token-by-token responses for knowledge graph and document retrieval queries. This extends the existing streaming architecture already implemented for LLM text-completion, prompt, and agent services. ## Goals - **Consistent streaming UX**: Provide the same streaming experience across all TrustGraph services - **Minimal API changes**: Add streaming support with a single `streaming` flag, following established patterns - **Backward compatibility**: Maintain existing non-streaming behavior as default - **Reuse existing infrastructure**: Leverage PromptClient streaming already implemented - **Gateway support**: Enable streaming through websocket gateway for client applications ## Background Currently implemented streaming services: - **LLM text-completion service**: Phase 1 - streaming from LLM providers - **Prompt service**: Phase 2 - streaming through prompt templates - **Agent service**: Phase 3-4 - streaming ReAct responses with incremental thought/observation/answer chunks Current limitations for RAG services: - GraphRAG and DocumentRAG only support blocking responses - Users must wait for complete LLM response before seeing any output - Poor UX for long responses from knowledge graph or document queries - Inconsistent experience compared to other TrustGraph services This specification addresses these gaps by adding streaming support to GraphRAG and DocumentRAG. By enabling token-by-token responses, TrustGraph can: - Provide consistent streaming UX across all query types - Reduce perceived latency for RAG queries - Enable better progress feedback for long-running queries - Support real-time display in client applications ## Technical Design ### Architecture The RAG streaming implementation leverages existing infrastructure: 1. **PromptClient Streaming** (Already implemented) - `kg_prompt()` and `document_prompt()` already accept `streaming` and `chunk_callback` parameters - These call `prompt()` internally with streaming support - No changes needed to PromptClient Module: `trustgraph-base/trustgraph/base/prompt_client.py` 2. **GraphRAG Service** (Needs streaming parameter pass-through) - Add `streaming` parameter to `query()` method - Pass streaming flag and callbacks to `prompt_client.kg_prompt()` - GraphRagRequest schema needs `streaming` field Modules: - `trustgraph-flow/trustgraph/retrieval/graph_rag/graph_rag.py` - `trustgraph-flow/trustgraph/retrieval/graph_rag/rag.py` (Processor) - `trustgraph-base/trustgraph/schema/graph_rag.py` (Request schema) - `trustgraph-flow/trustgraph/gateway/dispatch/graph_rag.py` (Gateway) 3. **DocumentRAG Service** (Needs streaming parameter pass-through) - Add `streaming` parameter to `query()` method - Pass streaming flag and callbacks to `prompt_client.document_prompt()` - DocumentRagRequest schema needs `streaming` field Modules: - `trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py` - `trustgraph-flow/trustgraph/retrieval/document_rag/rag.py` (Processor) - `trustgraph-base/trustgraph/schema/document_rag.py` (Request schema) - `trustgraph-flow/trustgraph/gateway/dispatch/document_rag.py` (Gateway) ### Data Flow **Non-streaming (current)**: ``` Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=False) ↓ Prompt Service → LLM ↓ Complete response ↓ Client ← Gateway ← RAG Service ← Response ``` **Streaming (proposed)**: ``` Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=True, chunk_callback=cb) ↓ Prompt Service → LLM (streaming) ↓ Chunk → callback → RAG Response (chunk) ↓ ↓ Client ← Gateway ← ────────────────────────────────── Response stream ``` ### APIs **GraphRAG Changes**: 1. **GraphRag.query()** - Add streaming parameters ```python async def query( self, query, user, collection, verbose=False, streaming=False, chunk_callback=None # NEW ): # ... existing entity/triple retrieval ... if streaming and chunk_callback: resp = await self.prompt_client.kg_prompt( query, kg, streaming=True, chunk_callback=chunk_callback ) else: resp = await self.prompt_client.kg_prompt(query, kg) return resp ``` 2. **GraphRagRequest schema** - Add streaming field ```python class GraphRagRequest(Record): query = String() user = String() collection = String() streaming = Boolean() # NEW ``` 3. **GraphRagResponse schema** - Add streaming fields (follow Agent pattern) ```python class GraphRagResponse(Record): response = String() # Legacy: complete response chunk = String() # NEW: streaming chunk end_of_stream = Boolean() # NEW: indicates last chunk ``` 4. **Processor** - Pass streaming through ```python async def handle(self, msg): # ... existing code ... async def send_chunk(chunk): await self.respond(GraphRagResponse( chunk=chunk, end_of_stream=False, response=None )) if request.streaming: full_response = await self.rag.query( query=request.query, user=request.user, collection=request.collection, streaming=True, chunk_callback=send_chunk ) # Send final message await self.respond(GraphRagResponse( chunk=None, end_of_stream=True, response=full_response )) else: # Existing non-streaming path response = await self.rag.query(...) await self.respond(GraphRagResponse(response=response)) ``` **DocumentRAG Changes**: Identical pattern to GraphRAG: 1. Add `streaming` and `chunk_callback` parameters to `DocumentRag.query()` 2. Add `streaming` field to `DocumentRagRequest` 3. Add `chunk` and `end_of_stream` fields to `DocumentRagResponse` 4. Update Processor to handle streaming with callbacks **Gateway Changes**: Both `graph_rag.py` and `document_rag.py` in gateway/dispatch need updates to forward streaming chunks to websocket: ```python async def handle(self, message, session, websocket): # ... existing code ... if request.streaming: async def recipient(resp): if resp.chunk: await websocket.send(json.dumps({ "id": message["id"], "response": {"chunk": resp.chunk}, "complete": resp.end_of_stream })) return resp.end_of_stream await self.rag_client.request(request, recipient=recipient) else: # Existing non-streaming path resp = await self.rag_client.request(request) await websocket.send(...) ``` ### Implementation Details **Implementation order**: 1. Add schema fields (Request + Response for both RAG services) 2. Update GraphRag.query() and DocumentRag.query() methods 3. Update Processors to handle streaming 4. Update Gateway dispatch handlers 5. Add `--no-streaming` flags to `tg-invoke-graph-rag` and `tg-invoke-document-rag` (streaming enabled by default, following agent CLI pattern) **Callback pattern**: Follow the same async callback pattern established in Agent streaming: - Processor defines `async def send_chunk(chunk)` callback - Passes callback to RAG service - RAG service passes callback to PromptClient - PromptClient invokes callback for each LLM chunk - Processor sends streaming response message for each chunk **Error handling**: - Errors during streaming should send error response with `end_of_stream=True` - Follow existing error propagation patterns from Agent streaming ## Security Considerations No new security considerations beyond existing RAG services: - Streaming responses use same user/collection isolation - No changes to authentication or authorization - Chunk boundaries don't expose sensitive data ## Performance Considerations **Benefits**: - Reduced perceived latency (first tokens arrive faster) - Better UX for long responses - Lower memory usage (no need to buffer complete response) **Potential concerns**: - More Pulsar messages for streaming responses - Slightly higher CPU for chunking/callback overhead - Mitigated by: streaming is opt-in, default remains non-streaming **Testing considerations**: - Test with large knowledge graphs (many triples) - Test with many retrieved documents - Measure overhead of streaming vs non-streaming ## Testing Strategy **Unit tests**: - Test GraphRag.query() with streaming=True/False - Test DocumentRag.query() with streaming=True/False - Mock PromptClient to verify callback invocations **Integration tests**: - Test full GraphRAG streaming flow (similar to existing agent streaming tests) - Test full DocumentRAG streaming flow - Test Gateway streaming forwarding - Test CLI streaming output **Manual testing**: - `tg-invoke-graph-rag -q "What is machine learning?"` (streaming by default) - `tg-invoke-document-rag -q "Summarize the documents about AI"` (streaming by default) - `tg-invoke-graph-rag --no-streaming -q "..."` (test non-streaming mode) - Verify incremental output appears in streaming mode ## Migration Plan No migration needed: - Streaming is opt-in via `streaming` parameter (defaults to False) - Existing clients continue to work unchanged - New clients can opt into streaming ## Timeline Estimated implementation: 4-6 hours - Phase 1 (2 hours): GraphRAG streaming support - Phase 2 (2 hours): DocumentRAG streaming support - Phase 3 (1-2 hours): Gateway updates and CLI flags - Testing: Built into each phase ## Open Questions - Should we add streaming support to NLP Query service as well? - Do we want to stream intermediate steps (e.g., "Retrieving entities...", "Querying graph...") or just LLM output? - Should GraphRAG/DocumentRAG responses include chunk metadata (e.g., chunk number, total expected)? ## References - Existing implementation: `docs/tech-specs/streaming-llm-responses.md` - Agent streaming: `trustgraph-flow/trustgraph/agent/react/agent_manager.py` - PromptClient streaming: `trustgraph-base/trustgraph/base/prompt_client.py`