Streaming rag responses (#568)

* Tech spec for streaming RAG * Support for streaming Graph/Doc RAG
2026-04-25 00:16:23 +02:00 · 2025-11-26 19:47:39 +00:00 · 2025-11-26 19:47:39 +00:00 · 1948edaa50
commit 1948edaa50
parent b1cc724f7d
20 changed files with 3087 additions and 94 deletions
--- a/docs/tech-specs/rag-streaming-support.md
+++ b/docs/tech-specs/rag-streaming-support.md
@ -0,0 +1,288 @@
+# RAG Streaming Support Technical Specification
+
+## Overview
+
+This specification describes adding streaming support to GraphRAG and DocumentRAG services, enabling real-time token-by-token responses for knowledge graph and document retrieval queries. This extends the existing streaming architecture already implemented for LLM text-completion, prompt, and agent services.
+
+## Goals
+
+- **Consistent streaming UX**: Provide the same streaming experience across all TrustGraph services
+- **Minimal API changes**: Add streaming support with a single `streaming` flag, following established patterns
+- **Backward compatibility**: Maintain existing non-streaming behavior as default
+- **Reuse existing infrastructure**: Leverage PromptClient streaming already implemented
+- **Gateway support**: Enable streaming through websocket gateway for client applications
+
+## Background
+
+Currently implemented streaming services:
+- **LLM text-completion service**: Phase 1 - streaming from LLM providers
+- **Prompt service**: Phase 2 - streaming through prompt templates
+- **Agent service**: Phase 3-4 - streaming ReAct responses with incremental thought/observation/answer chunks
+
+Current limitations for RAG services:
+- GraphRAG and DocumentRAG only support blocking responses
+- Users must wait for complete LLM response before seeing any output
+- Poor UX for long responses from knowledge graph or document queries
+- Inconsistent experience compared to other TrustGraph services
+
+This specification addresses these gaps by adding streaming support to GraphRAG and DocumentRAG. By enabling token-by-token responses, TrustGraph can:
+- Provide consistent streaming UX across all query types
+- Reduce perceived latency for RAG queries
+- Enable better progress feedback for long-running queries
+- Support real-time display in client applications
+
+## Technical Design
+
+### Architecture
+
+The RAG streaming implementation leverages existing infrastructure:
+
+1. **PromptClient Streaming** (Already implemented)
+   - `kg_prompt()` and `document_prompt()` already accept `streaming` and `chunk_callback` parameters
+   - These call `prompt()` internally with streaming support
+   - No changes needed to PromptClient
+
+   Module: `trustgraph-base/trustgraph/base/prompt_client.py`
+
+2. **GraphRAG Service** (Needs streaming parameter pass-through)
+   - Add `streaming` parameter to `query()` method
+   - Pass streaming flag and callbacks to `prompt_client.kg_prompt()`
+   - GraphRagRequest schema needs `streaming` field
+
+   Modules:
+   - `trustgraph-flow/trustgraph/retrieval/graph_rag/graph_rag.py`
+   - `trustgraph-flow/trustgraph/retrieval/graph_rag/rag.py` (Processor)
+   - `trustgraph-base/trustgraph/schema/graph_rag.py` (Request schema)
+   - `trustgraph-flow/trustgraph/gateway/dispatch/graph_rag.py` (Gateway)
+
+3. **DocumentRAG Service** (Needs streaming parameter pass-through)
+   - Add `streaming` parameter to `query()` method
+   - Pass streaming flag and callbacks to `prompt_client.document_prompt()`
+   - DocumentRagRequest schema needs `streaming` field
+
+   Modules:
+   - `trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py`
+   - `trustgraph-flow/trustgraph/retrieval/document_rag/rag.py` (Processor)
+   - `trustgraph-base/trustgraph/schema/document_rag.py` (Request schema)
+   - `trustgraph-flow/trustgraph/gateway/dispatch/document_rag.py` (Gateway)
+
+### Data Flow
+
+**Non-streaming (current)**:
+```
+Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=False)
+                                   ↓
+                                Prompt Service → LLM
+                                   ↓
+                                Complete response
+                                   ↓
+Client ← Gateway ← RAG Service ←  Response
+```
+
+**Streaming (proposed)**:
+```
+Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=True, chunk_callback=cb)
+                                   ↓
+                                Prompt Service → LLM (streaming)
+                                   ↓
+                                Chunk → callback → RAG Response (chunk)
+                                   ↓                       ↓
+Client ← Gateway ← ────────────────────────────────── Response stream
+```
+
+### APIs
+
+**GraphRAG Changes**:
+
+1. **GraphRag.query()** - Add streaming parameters
+```python
+async def query(
+    self, query, user, collection,
+    verbose=False, streaming=False, chunk_callback=None  # NEW
+):
+    # ... existing entity/triple retrieval ...
+
+    if streaming and chunk_callback:
+        resp = await self.prompt_client.kg_prompt(
+            query, kg,
+            streaming=True,
+            chunk_callback=chunk_callback
+        )
+    else:
+        resp = await self.prompt_client.kg_prompt(query, kg)
+
+    return resp
+```
+
+2. **GraphRagRequest schema** - Add streaming field
+```python
+class GraphRagRequest(Record):
+    query = String()
+    user = String()
+    collection = String()
+    streaming = Boolean()  # NEW
+```
+
+3. **GraphRagResponse schema** - Add streaming fields (follow Agent pattern)
+```python
+class GraphRagResponse(Record):
+    response = String()    # Legacy: complete response
+    chunk = String()       # NEW: streaming chunk
+    end_of_stream = Boolean()  # NEW: indicates last chunk
+```
+
+4. **Processor** - Pass streaming through
+```python
+async def handle(self, msg):
+    # ... existing code ...
+
+    async def send_chunk(chunk):
+        await self.respond(GraphRagResponse(
+            chunk=chunk,
+            end_of_stream=False,
+            response=None
+        ))
+
+    if request.streaming:
+        full_response = await self.rag.query(
+            query=request.query,
+            user=request.user,
+            collection=request.collection,
+            streaming=True,
+            chunk_callback=send_chunk
+        )
+        # Send final message
+        await self.respond(GraphRagResponse(
+            chunk=None,
+            end_of_stream=True,
+            response=full_response
+        ))
+    else:
+        # Existing non-streaming path
+        response = await self.rag.query(...)
+        await self.respond(GraphRagResponse(response=response))
+```
+
+**DocumentRAG Changes**:
+
+Identical pattern to GraphRAG:
+1. Add `streaming` and `chunk_callback` parameters to `DocumentRag.query()`
+2. Add `streaming` field to `DocumentRagRequest`
+3. Add `chunk` and `end_of_stream` fields to `DocumentRagResponse`
+4. Update Processor to handle streaming with callbacks
+
+**Gateway Changes**:
+
+Both `graph_rag.py` and `document_rag.py` in gateway/dispatch need updates to forward streaming chunks to websocket:
+
+```python
+async def handle(self, message, session, websocket):
+    # ... existing code ...
+
+    if request.streaming:
+        async def recipient(resp):
+            if resp.chunk:
+                await websocket.send(json.dumps({
+                    "id": message["id"],
+                    "response": {"chunk": resp.chunk},
+                    "complete": resp.end_of_stream
+                }))
+            return resp.end_of_stream
+
+        await self.rag_client.request(request, recipient=recipient)
+    else:
+        # Existing non-streaming path
+        resp = await self.rag_client.request(request)
+        await websocket.send(...)
+```
+
+### Implementation Details
+
+**Implementation order**:
+1. Add schema fields (Request + Response for both RAG services)
+2. Update GraphRag.query() and DocumentRag.query() methods
+3. Update Processors to handle streaming
+4. Update Gateway dispatch handlers
+5. Add `--no-streaming` flags to `tg-invoke-graph-rag` and `tg-invoke-document-rag` (streaming enabled by default, following agent CLI pattern)
+
+**Callback pattern**:
+Follow the same async callback pattern established in Agent streaming:
+- Processor defines `async def send_chunk(chunk)` callback
+- Passes callback to RAG service
+- RAG service passes callback to PromptClient
+- PromptClient invokes callback for each LLM chunk
+- Processor sends streaming response message for each chunk
+
+**Error handling**:
+- Errors during streaming should send error response with `end_of_stream=True`
+- Follow existing error propagation patterns from Agent streaming
+
+## Security Considerations
+
+No new security considerations beyond existing RAG services:
+- Streaming responses use same user/collection isolation
+- No changes to authentication or authorization
+- Chunk boundaries don't expose sensitive data
+
+## Performance Considerations
+
+**Benefits**:
+- Reduced perceived latency (first tokens arrive faster)
+- Better UX for long responses
+- Lower memory usage (no need to buffer complete response)
+
+**Potential concerns**:
+- More Pulsar messages for streaming responses
+- Slightly higher CPU for chunking/callback overhead
+- Mitigated by: streaming is opt-in, default remains non-streaming
+
+**Testing considerations**:
+- Test with large knowledge graphs (many triples)
+- Test with many retrieved documents
+- Measure overhead of streaming vs non-streaming
+
+## Testing Strategy
+
+**Unit tests**:
+- Test GraphRag.query() with streaming=True/False
+- Test DocumentRag.query() with streaming=True/False
+- Mock PromptClient to verify callback invocations
+
+**Integration tests**:
+- Test full GraphRAG streaming flow (similar to existing agent streaming tests)
+- Test full DocumentRAG streaming flow
+- Test Gateway streaming forwarding
+- Test CLI streaming output
+
+**Manual testing**:
+- `tg-invoke-graph-rag -q "What is machine learning?"` (streaming by default)
+- `tg-invoke-document-rag -q "Summarize the documents about AI"` (streaming by default)
+- `tg-invoke-graph-rag --no-streaming -q "..."` (test non-streaming mode)
+- Verify incremental output appears in streaming mode
+
+## Migration Plan
+
+No migration needed:
+- Streaming is opt-in via `streaming` parameter (defaults to False)
+- Existing clients continue to work unchanged
+- New clients can opt into streaming
+
+## Timeline
+
+Estimated implementation: 4-6 hours
+- Phase 1 (2 hours): GraphRAG streaming support
+- Phase 2 (2 hours): DocumentRAG streaming support
+- Phase 3 (1-2 hours): Gateway updates and CLI flags
+- Testing: Built into each phase
+
+## Open Questions
+
+- Should we add streaming support to NLP Query service as well?
+- Do we want to stream intermediate steps (e.g., "Retrieving entities...", "Querying graph...") or just LLM output?
+- Should GraphRAG/DocumentRAG responses include chunk metadata (e.g., chunk number, total expected)?
+
+## References
+
+- Existing implementation: `docs/tech-specs/streaming-llm-responses.md`
+- Agent streaming: `trustgraph-flow/trustgraph/agent/react/agent_manager.py`
+- PromptClient streaming: `trustgraph-base/trustgraph/base/prompt_client.py`