trustgraph/docs/tech-specs/rag-streaming-support.md

---
layout: default
title: "RAG Streaming Support Technical Specification"
parent: "Tech Specs"
---

# RAG Streaming Support Technical Specification

## Overview

This specification describes adding streaming support to GraphRAG and DocumentRAG services, enabling real-time token-by-token responses for knowledge graph and document retrieval queries. This extends the existing streaming architecture already implemented for LLM text-completion, prompt, and agent services.

## Goals

- **Consistent streaming UX**: Provide the same streaming experience across all TrustGraph services
- **Minimal API changes**: Add streaming support with a single `streaming` flag, following established patterns
- **Backward compatibility**: Maintain existing non-streaming behavior as default
- **Reuse existing infrastructure**: Leverage PromptClient streaming already implemented
- **Gateway support**: Enable streaming through websocket gateway for client applications

## Background

Currently implemented streaming services:
- **LLM text-completion service**: Phase 1 - streaming from LLM providers
- **Prompt service**: Phase 2 - streaming through prompt templates
- **Agent service**: Phase 3-4 - streaming ReAct responses with incremental thought/observation/answer chunks

Current limitations for RAG services:
- GraphRAG and DocumentRAG only support blocking responses
- Users must wait for complete LLM response before seeing any output
- Poor UX for long responses from knowledge graph or document queries
- Inconsistent experience compared to other TrustGraph services

This specification addresses these gaps by adding streaming support to GraphRAG and DocumentRAG. By enabling token-by-token responses, TrustGraph can:
- Provide consistent streaming UX across all query types
- Reduce perceived latency for RAG queries
- Enable better progress feedback for long-running queries
- Support real-time display in client applications

## Technical Design

### Architecture

The RAG streaming implementation leverages existing infrastructure:

1. **PromptClient Streaming** (Already implemented)
   - `kg_prompt()` and `document_prompt()` already accept `streaming` and `chunk_callback` parameters
   - These call `prompt()` internally with streaming support
   - No changes needed to PromptClient

   Module: `trustgraph-base/trustgraph/base/prompt_client.py`

2. **GraphRAG Service** (Needs streaming parameter pass-through)
   - Add `streaming` parameter to `query()` method
   - Pass streaming flag and callbacks to `prompt_client.kg_prompt()`
   - GraphRagRequest schema needs `streaming` field

   Modules:
   - `trustgraph-flow/trustgraph/retrieval/graph_rag/graph_rag.py`
   - `trustgraph-flow/trustgraph/retrieval/graph_rag/rag.py` (Processor)
   - `trustgraph-base/trustgraph/schema/graph_rag.py` (Request schema)
   - `trustgraph-flow/trustgraph/gateway/dispatch/graph_rag.py` (Gateway)

3. **DocumentRAG Service** (Needs streaming parameter pass-through)
   - Add `streaming` parameter to `query()` method
   - Pass streaming flag and callbacks to `prompt_client.document_prompt()`
   - DocumentRagRequest schema needs `streaming` field

   Modules:
   - `trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py`
   - `trustgraph-flow/trustgraph/retrieval/document_rag/rag.py` (Processor)
   - `trustgraph-base/trustgraph/schema/document_rag.py` (Request schema)
   - `trustgraph-flow/trustgraph/gateway/dispatch/document_rag.py` (Gateway)

### Data Flow

**Non-streaming (current)**:
```
Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=False)
                                   ↓
                                Prompt Service → LLM
                                   ↓
                                Complete response
                                   ↓
Client ← Gateway ← RAG Service ←  Response
```

**Streaming (proposed)**:
```
Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=True, chunk_callback=cb)
                                   ↓
                                Prompt Service → LLM (streaming)
                                   ↓
                                Chunk → callback → RAG Response (chunk)
                                   ↓                       ↓
Client ← Gateway ← ────────────────────────────────── Response stream
```

### APIs

**GraphRAG Changes**:

1. **GraphRag.query()** - Add streaming parameters
```python
async def query(
    self, query, user, collection,
    verbose=False, streaming=False, chunk_callback=None  # NEW
):
    # ... existing entity/triple retrieval ...

    if streaming and chunk_callback:
        resp = await self.prompt_client.kg_prompt(
            query, kg,
            streaming=True,
            chunk_callback=chunk_callback
        )
    else:
        resp = await self.prompt_client.kg_prompt(query, kg)

    return resp
```

2. **GraphRagRequest schema** - Add streaming field
```python
class GraphRagRequest(Record):
    query = String()
    user = String()
    collection = String()
    streaming = Boolean()  # NEW
```

3. **GraphRagResponse schema** - Add streaming fields (follow Agent pattern)
```python
class GraphRagResponse(Record):
    response = String()    # Legacy: complete response
    chunk = String()       # NEW: streaming chunk
    end_of_stream = Boolean()  # NEW: indicates last chunk
```

4. **Processor** - Pass streaming through
```python
async def handle(self, msg):
    # ... existing code ...

    async def send_chunk(chunk):
        await self.respond(GraphRagResponse(
            chunk=chunk,
            end_of_stream=False,
            response=None
        ))

    if request.streaming:
        full_response = await self.rag.query(
            query=request.query,
            user=request.user,
            collection=request.collection,
            streaming=True,
            chunk_callback=send_chunk
        )
        # Send final message
        await self.respond(GraphRagResponse(
            chunk=None,
            end_of_stream=True,
            response=full_response
        ))
    else:
        # Existing non-streaming path
        response = await self.rag.query(...)
        await self.respond(GraphRagResponse(response=response))
```

**DocumentRAG Changes**:

Identical pattern to GraphRAG:
1. Add `streaming` and `chunk_callback` parameters to `DocumentRag.query()`
2. Add `streaming` field to `DocumentRagRequest`
3. Add `chunk` and `end_of_stream` fields to `DocumentRagResponse`
4. Update Processor to handle streaming with callbacks

**Gateway Changes**:

Both `graph_rag.py` and `document_rag.py` in gateway/dispatch need updates to forward streaming chunks to websocket:

```python
async def handle(self, message, session, websocket):
    # ... existing code ...

    if request.streaming:
        async def recipient(resp):
            if resp.chunk:
                await websocket.send(json.dumps({
                    "id": message["id"],
                    "response": {"chunk": resp.chunk},
                    "complete": resp.end_of_stream
                }))
            return resp.end_of_stream

        await self.rag_client.request(request, recipient=recipient)
    else:
        # Existing non-streaming path
        resp = await self.rag_client.request(request)
        await websocket.send(...)
```

### Implementation Details

**Implementation order**:
1. Add schema fields (Request + Response for both RAG services)
2. Update GraphRag.query() and DocumentRag.query() methods
3. Update Processors to handle streaming
4. Update Gateway dispatch handlers
5. Add `--no-streaming` flags to `tg-invoke-graph-rag` and `tg-invoke-document-rag` (streaming enabled by default, following agent CLI pattern)

**Callback pattern**:
Follow the same async callback pattern established in Agent streaming:
- Processor defines `async def send_chunk(chunk)` callback
- Passes callback to RAG service
- RAG service passes callback to PromptClient
- PromptClient invokes callback for each LLM chunk
- Processor sends streaming response message for each chunk

**Error handling**:
- Errors during streaming should send error response with `end_of_stream=True`
- Follow existing error propagation patterns from Agent streaming

## Security Considerations

No new security considerations beyond existing RAG services:
- Streaming responses use same user/collection isolation
- No changes to authentication or authorization
- Chunk boundaries don't expose sensitive data

## Performance Considerations

**Benefits**:
- Reduced perceived latency (first tokens arrive faster)
- Better UX for long responses
- Lower memory usage (no need to buffer complete response)

**Potential concerns**:
- More Pulsar messages for streaming responses
- Slightly higher CPU for chunking/callback overhead
- Mitigated by: streaming is opt-in, default remains non-streaming

**Testing considerations**:
- Test with large knowledge graphs (many triples)
- Test with many retrieved documents
- Measure overhead of streaming vs non-streaming

## Testing Strategy

**Unit tests**:
- Test GraphRag.query() with streaming=True/False
- Test DocumentRag.query() with streaming=True/False
- Mock PromptClient to verify callback invocations

**Integration tests**:
- Test full GraphRAG streaming flow (similar to existing agent streaming tests)
- Test full DocumentRAG streaming flow
- Test Gateway streaming forwarding
- Test CLI streaming output

**Manual testing**:
- `tg-invoke-graph-rag -q "What is machine learning?"` (streaming by default)
- `tg-invoke-document-rag -q "Summarize the documents about AI"` (streaming by default)
- `tg-invoke-graph-rag --no-streaming -q "..."` (test non-streaming mode)
- Verify incremental output appears in streaming mode

## Migration Plan

No migration needed:
- Streaming is opt-in via `streaming` parameter (defaults to False)
- Existing clients continue to work unchanged
- New clients can opt into streaming

## Timeline

Estimated implementation: 4-6 hours
- Phase 1 (2 hours): GraphRAG streaming support
- Phase 2 (2 hours): DocumentRAG streaming support
- Phase 3 (1-2 hours): Gateway updates and CLI flags
- Testing: Built into each phase

## Open Questions

- Should we add streaming support to NLP Query service as well?
- Do we want to stream intermediate steps (e.g., "Retrieving entities...", "Querying graph...") or just LLM output?
- Should GraphRAG/DocumentRAG responses include chunk metadata (e.g., chunk number, total expected)?

## References

- Existing implementation: `docs/tech-specs/streaming-llm-responses.md`
- Agent streaming: `trustgraph-flow/trustgraph/agent/react/agent_manager.py`
- PromptClient streaming: `trustgraph-base/trustgraph/base/prompt_client.py`
Feat: TrustGraph i18n & Documentation Translation Updates (#781) Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian. 2026-04-14 07:07:58 -04:00			`---`
			`layout: default`
			`title: "RAG Streaming Support Technical Specification"`
			`parent: "Tech Specs"`
			`---`

Streaming rag responses (#568) * Tech spec for streaming RAG * Support for streaming Graph/Doc RAG 2025-11-26 19:47:39 +00:00			`# RAG Streaming Support Technical Specification`

			`## Overview`

			`This specification describes adding streaming support to GraphRAG and DocumentRAG services, enabling real-time token-by-token responses for knowledge graph and document retrieval queries. This extends the existing streaming architecture already implemented for LLM text-completion, prompt, and agent services.`

			`## Goals`

			`- Consistent streaming UX: Provide the same streaming experience across all TrustGraph services`
			- Minimal API changes: Add streaming support with a single `streaming` flag, following established patterns
			`- Backward compatibility: Maintain existing non-streaming behavior as default`
			`- Reuse existing infrastructure: Leverage PromptClient streaming already implemented`
			`- Gateway support: Enable streaming through websocket gateway for client applications`

			`## Background`

			`Currently implemented streaming services:`
			`- LLM text-completion service: Phase 1 - streaming from LLM providers`
			`- Prompt service: Phase 2 - streaming through prompt templates`
			`- Agent service: Phase 3-4 - streaming ReAct responses with incremental thought/observation/answer chunks`

			`Current limitations for RAG services:`
			`- GraphRAG and DocumentRAG only support blocking responses`
			`- Users must wait for complete LLM response before seeing any output`
			`- Poor UX for long responses from knowledge graph or document queries`
			`- Inconsistent experience compared to other TrustGraph services`

			`This specification addresses these gaps by adding streaming support to GraphRAG and DocumentRAG. By enabling token-by-token responses, TrustGraph can:`
			`- Provide consistent streaming UX across all query types`
			`- Reduce perceived latency for RAG queries`
			`- Enable better progress feedback for long-running queries`
			`- Support real-time display in client applications`

			`## Technical Design`

			`### Architecture`

			`The RAG streaming implementation leverages existing infrastructure:`

			`1. PromptClient Streaming (Already implemented)`
			- `kg_prompt()` and `document_prompt()` already accept `streaming` and `chunk_callback` parameters
			- These call `prompt()` internally with streaming support
			`- No changes needed to PromptClient`

			Module: `trustgraph-base/trustgraph/base/prompt_client.py`

			`2. GraphRAG Service (Needs streaming parameter pass-through)`
			- Add `streaming` parameter to `query()` method
			- Pass streaming flag and callbacks to `prompt_client.kg_prompt()`
			- GraphRagRequest schema needs `streaming` field

			`Modules:`
			- `trustgraph-flow/trustgraph/retrieval/graph_rag/graph_rag.py`
			- `trustgraph-flow/trustgraph/retrieval/graph_rag/rag.py` (Processor)
			- `trustgraph-base/trustgraph/schema/graph_rag.py` (Request schema)
			- `trustgraph-flow/trustgraph/gateway/dispatch/graph_rag.py` (Gateway)

			`3. DocumentRAG Service (Needs streaming parameter pass-through)`
			- Add `streaming` parameter to `query()` method
			- Pass streaming flag and callbacks to `prompt_client.document_prompt()`
			- DocumentRagRequest schema needs `streaming` field

			`Modules:`
			- `trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py`
			- `trustgraph-flow/trustgraph/retrieval/document_rag/rag.py` (Processor)
			- `trustgraph-base/trustgraph/schema/document_rag.py` (Request schema)
			- `trustgraph-flow/trustgraph/gateway/dispatch/document_rag.py` (Gateway)

			`### Data Flow`

			`Non-streaming (current):`
			```
			`Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=False)`
			`↓`
			`Prompt Service → LLM`
			`↓`
			`Complete response`
			`↓`
			`Client ← Gateway ← RAG Service ← Response`
			```

			`Streaming (proposed):`
			```
			`Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=True, chunk_callback=cb)`
			`↓`
			`Prompt Service → LLM (streaming)`
			`↓`
			`Chunk → callback → RAG Response (chunk)`
			`↓ ↓`
			`Client ← Gateway ← ────────────────────────────────── Response stream`
			```

			`### APIs`

			`GraphRAG Changes:`

			`1. GraphRag.query() - Add streaming parameters`
			```python
			`async def query(`
			`self, query, user, collection,`
			`verbose=False, streaming=False, chunk_callback=None # NEW`
			`):`
			`# ... existing entity/triple retrieval ...`

			`if streaming and chunk_callback:`
			`resp = await self.prompt_client.kg_prompt(`
			`query, kg,`
			`streaming=True,`
			`chunk_callback=chunk_callback`
			`)`
			`else:`
			`resp = await self.prompt_client.kg_prompt(query, kg)`

			`return resp`
			```

			`2. GraphRagRequest schema - Add streaming field`
			```python
			`class GraphRagRequest(Record):`
			`query = String()`
			`user = String()`
			`collection = String()`
			`streaming = Boolean() # NEW`
			```

			`3. GraphRagResponse schema - Add streaming fields (follow Agent pattern)`
			```python
			`class GraphRagResponse(Record):`
			`response = String() # Legacy: complete response`
			`chunk = String() # NEW: streaming chunk`
			`end_of_stream = Boolean() # NEW: indicates last chunk`
			```

			`4. Processor - Pass streaming through`
			```python
			`async def handle(self, msg):`
			`# ... existing code ...`

			`async def send_chunk(chunk):`
			`await self.respond(GraphRagResponse(`
			`chunk=chunk,`
			`end_of_stream=False,`
			`response=None`
			`))`

			`if request.streaming:`
			`full_response = await self.rag.query(`
			`query=request.query,`
			`user=request.user,`
			`collection=request.collection,`
			`streaming=True,`
			`chunk_callback=send_chunk`
			`)`
			`# Send final message`
			`await self.respond(GraphRagResponse(`
			`chunk=None,`
			`end_of_stream=True,`
			`response=full_response`
			`))`
			`else:`
			`# Existing non-streaming path`
			`response = await self.rag.query(...)`
			`await self.respond(GraphRagResponse(response=response))`
			```

			`DocumentRAG Changes:`

			`Identical pattern to GraphRAG:`
			1. Add `streaming` and `chunk_callback` parameters to `DocumentRag.query()`
			2. Add `streaming` field to `DocumentRagRequest`
			3. Add `chunk` and `end_of_stream` fields to `DocumentRagResponse`
			`4. Update Processor to handle streaming with callbacks`

			`Gateway Changes:`

			Both `graph_rag.py` and `document_rag.py` in gateway/dispatch need updates to forward streaming chunks to websocket:

			```python
			`async def handle(self, message, session, websocket):`
			`# ... existing code ...`

			`if request.streaming:`
			`async def recipient(resp):`
			`if resp.chunk:`
			`await websocket.send(json.dumps({`
			`"id": message["id"],`
			`"response": {"chunk": resp.chunk},`
			`"complete": resp.end_of_stream`
			`}))`
			`return resp.end_of_stream`

			`await self.rag_client.request(request, recipient=recipient)`
			`else:`
			`# Existing non-streaming path`
			`resp = await self.rag_client.request(request)`
			`await websocket.send(...)`
			```

			`### Implementation Details`

			`Implementation order:`
			`1. Add schema fields (Request + Response for both RAG services)`
			`2. Update GraphRag.query() and DocumentRag.query() methods`
			`3. Update Processors to handle streaming`
			`4. Update Gateway dispatch handlers`
			5. Add `--no-streaming` flags to `tg-invoke-graph-rag` and `tg-invoke-document-rag` (streaming enabled by default, following agent CLI pattern)

			`Callback pattern:`
			`Follow the same async callback pattern established in Agent streaming:`
			- Processor defines `async def send_chunk(chunk)` callback
			`- Passes callback to RAG service`
			`- RAG service passes callback to PromptClient`
			`- PromptClient invokes callback for each LLM chunk`
			`- Processor sends streaming response message for each chunk`

			`Error handling:`
			- Errors during streaming should send error response with `end_of_stream=True`
			`- Follow existing error propagation patterns from Agent streaming`

			`## Security Considerations`

			`No new security considerations beyond existing RAG services:`
			`- Streaming responses use same user/collection isolation`
			`- No changes to authentication or authorization`
			`- Chunk boundaries don't expose sensitive data`

			`## Performance Considerations`

			`Benefits:`
			`- Reduced perceived latency (first tokens arrive faster)`
			`- Better UX for long responses`
			`- Lower memory usage (no need to buffer complete response)`

			`Potential concerns:`
			`- More Pulsar messages for streaming responses`
			`- Slightly higher CPU for chunking/callback overhead`
			`- Mitigated by: streaming is opt-in, default remains non-streaming`

			`Testing considerations:`
			`- Test with large knowledge graphs (many triples)`
			`- Test with many retrieved documents`
			`- Measure overhead of streaming vs non-streaming`

			`## Testing Strategy`

			`Unit tests:`
			`- Test GraphRag.query() with streaming=True/False`
			`- Test DocumentRag.query() with streaming=True/False`
			`- Mock PromptClient to verify callback invocations`

			`Integration tests:`
			`- Test full GraphRAG streaming flow (similar to existing agent streaming tests)`
			`- Test full DocumentRAG streaming flow`
			`- Test Gateway streaming forwarding`
			`- Test CLI streaming output`

			`Manual testing:`
			- `tg-invoke-graph-rag -q "What is machine learning?"` (streaming by default)
			- `tg-invoke-document-rag -q "Summarize the documents about AI"` (streaming by default)
			- `tg-invoke-graph-rag --no-streaming -q "..."` (test non-streaming mode)
			`- Verify incremental output appears in streaming mode`

			`## Migration Plan`

			`No migration needed:`
			- Streaming is opt-in via `streaming` parameter (defaults to False)
			`- Existing clients continue to work unchanged`
			`- New clients can opt into streaming`

			`## Timeline`

			`Estimated implementation: 4-6 hours`
			`- Phase 1 (2 hours): GraphRAG streaming support`
			`- Phase 2 (2 hours): DocumentRAG streaming support`
			`- Phase 3 (1-2 hours): Gateway updates and CLI flags`
			`- Testing: Built into each phase`

			`## Open Questions`

			`- Should we add streaming support to NLP Query service as well?`
			`- Do we want to stream intermediate steps (e.g., "Retrieving entities...", "Querying graph...") or just LLM output?`
			`- Should GraphRAG/DocumentRAG responses include chunk metadata (e.g., chunk number, total expected)?`

			`## References`

			- Existing implementation: `docs/tech-specs/streaming-llm-responses.md`
			- Agent streaming: `trustgraph-flow/trustgraph/agent/react/agent_manager.py`
			- PromptClient streaming: `trustgraph-base/trustgraph/base/prompt_client.py`