trustgraph/docs/tech-specs/rag-streaming-support.md
Alex Jenkins 8954fa3ad7 Feat: TrustGraph i18n & Documentation Translation Updates (#781)
Native CLI i18n: The TrustGraph CLI has built-in translation support
that dynamically loads language strings. You can test and use
different languages by simply passing the --lang flag (e.g., --lang
es for Spanish, --lang ru for Russian) or by configuring your
environment's LANG variable.

Automated Docs Translations: This PR introduces autonomously
translated Markdown documentation into several target languages,
including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew,
Arabic, Simplified Chinese, and Russian.
2026-04-14 12:08:32 +01:00

10 KiB

layout title parent
default RAG Streaming Support Technical Specification Tech Specs

RAG Streaming Support Technical Specification

Overview

This specification describes adding streaming support to GraphRAG and DocumentRAG services, enabling real-time token-by-token responses for knowledge graph and document retrieval queries. This extends the existing streaming architecture already implemented for LLM text-completion, prompt, and agent services.

Goals

  • Consistent streaming UX: Provide the same streaming experience across all TrustGraph services
  • Minimal API changes: Add streaming support with a single streaming flag, following established patterns
  • Backward compatibility: Maintain existing non-streaming behavior as default
  • Reuse existing infrastructure: Leverage PromptClient streaming already implemented
  • Gateway support: Enable streaming through websocket gateway for client applications

Background

Currently implemented streaming services:

  • LLM text-completion service: Phase 1 - streaming from LLM providers
  • Prompt service: Phase 2 - streaming through prompt templates
  • Agent service: Phase 3-4 - streaming ReAct responses with incremental thought/observation/answer chunks

Current limitations for RAG services:

  • GraphRAG and DocumentRAG only support blocking responses
  • Users must wait for complete LLM response before seeing any output
  • Poor UX for long responses from knowledge graph or document queries
  • Inconsistent experience compared to other TrustGraph services

This specification addresses these gaps by adding streaming support to GraphRAG and DocumentRAG. By enabling token-by-token responses, TrustGraph can:

  • Provide consistent streaming UX across all query types
  • Reduce perceived latency for RAG queries
  • Enable better progress feedback for long-running queries
  • Support real-time display in client applications

Technical Design

Architecture

The RAG streaming implementation leverages existing infrastructure:

  1. PromptClient Streaming (Already implemented)

    • kg_prompt() and document_prompt() already accept streaming and chunk_callback parameters
    • These call prompt() internally with streaming support
    • No changes needed to PromptClient

    Module: trustgraph-base/trustgraph/base/prompt_client.py

  2. GraphRAG Service (Needs streaming parameter pass-through)

    • Add streaming parameter to query() method
    • Pass streaming flag and callbacks to prompt_client.kg_prompt()
    • GraphRagRequest schema needs streaming field

    Modules:

    • trustgraph-flow/trustgraph/retrieval/graph_rag/graph_rag.py
    • trustgraph-flow/trustgraph/retrieval/graph_rag/rag.py (Processor)
    • trustgraph-base/trustgraph/schema/graph_rag.py (Request schema)
    • trustgraph-flow/trustgraph/gateway/dispatch/graph_rag.py (Gateway)
  3. DocumentRAG Service (Needs streaming parameter pass-through)

    • Add streaming parameter to query() method
    • Pass streaming flag and callbacks to prompt_client.document_prompt()
    • DocumentRagRequest schema needs streaming field

    Modules:

    • trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py
    • trustgraph-flow/trustgraph/retrieval/document_rag/rag.py (Processor)
    • trustgraph-base/trustgraph/schema/document_rag.py (Request schema)
    • trustgraph-flow/trustgraph/gateway/dispatch/document_rag.py (Gateway)

Data Flow

Non-streaming (current):

Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=False)
                                   ↓
                                Prompt Service → LLM
                                   ↓
                                Complete response
                                   ↓
Client ← Gateway ← RAG Service ←  Response

Streaming (proposed):

Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=True, chunk_callback=cb)
                                   ↓
                                Prompt Service → LLM (streaming)
                                   ↓
                                Chunk → callback → RAG Response (chunk)
                                   ↓                       ↓
Client ← Gateway ← ────────────────────────────────── Response stream

APIs

GraphRAG Changes:

  1. GraphRag.query() - Add streaming parameters
async def query(
    self, query, user, collection,
    verbose=False, streaming=False, chunk_callback=None  # NEW
):
    # ... existing entity/triple retrieval ...

    if streaming and chunk_callback:
        resp = await self.prompt_client.kg_prompt(
            query, kg,
            streaming=True,
            chunk_callback=chunk_callback
        )
    else:
        resp = await self.prompt_client.kg_prompt(query, kg)

    return resp
  1. GraphRagRequest schema - Add streaming field
class GraphRagRequest(Record):
    query = String()
    user = String()
    collection = String()
    streaming = Boolean()  # NEW
  1. GraphRagResponse schema - Add streaming fields (follow Agent pattern)
class GraphRagResponse(Record):
    response = String()    # Legacy: complete response
    chunk = String()       # NEW: streaming chunk
    end_of_stream = Boolean()  # NEW: indicates last chunk
  1. Processor - Pass streaming through
async def handle(self, msg):
    # ... existing code ...

    async def send_chunk(chunk):
        await self.respond(GraphRagResponse(
            chunk=chunk,
            end_of_stream=False,
            response=None
        ))

    if request.streaming:
        full_response = await self.rag.query(
            query=request.query,
            user=request.user,
            collection=request.collection,
            streaming=True,
            chunk_callback=send_chunk
        )
        # Send final message
        await self.respond(GraphRagResponse(
            chunk=None,
            end_of_stream=True,
            response=full_response
        ))
    else:
        # Existing non-streaming path
        response = await self.rag.query(...)
        await self.respond(GraphRagResponse(response=response))

DocumentRAG Changes:

Identical pattern to GraphRAG:

  1. Add streaming and chunk_callback parameters to DocumentRag.query()
  2. Add streaming field to DocumentRagRequest
  3. Add chunk and end_of_stream fields to DocumentRagResponse
  4. Update Processor to handle streaming with callbacks

Gateway Changes:

Both graph_rag.py and document_rag.py in gateway/dispatch need updates to forward streaming chunks to websocket:

async def handle(self, message, session, websocket):
    # ... existing code ...

    if request.streaming:
        async def recipient(resp):
            if resp.chunk:
                await websocket.send(json.dumps({
                    "id": message["id"],
                    "response": {"chunk": resp.chunk},
                    "complete": resp.end_of_stream
                }))
            return resp.end_of_stream

        await self.rag_client.request(request, recipient=recipient)
    else:
        # Existing non-streaming path
        resp = await self.rag_client.request(request)
        await websocket.send(...)

Implementation Details

Implementation order:

  1. Add schema fields (Request + Response for both RAG services)
  2. Update GraphRag.query() and DocumentRag.query() methods
  3. Update Processors to handle streaming
  4. Update Gateway dispatch handlers
  5. Add --no-streaming flags to tg-invoke-graph-rag and tg-invoke-document-rag (streaming enabled by default, following agent CLI pattern)

Callback pattern: Follow the same async callback pattern established in Agent streaming:

  • Processor defines async def send_chunk(chunk) callback
  • Passes callback to RAG service
  • RAG service passes callback to PromptClient
  • PromptClient invokes callback for each LLM chunk
  • Processor sends streaming response message for each chunk

Error handling:

  • Errors during streaming should send error response with end_of_stream=True
  • Follow existing error propagation patterns from Agent streaming

Security Considerations

No new security considerations beyond existing RAG services:

  • Streaming responses use same user/collection isolation
  • No changes to authentication or authorization
  • Chunk boundaries don't expose sensitive data

Performance Considerations

Benefits:

  • Reduced perceived latency (first tokens arrive faster)
  • Better UX for long responses
  • Lower memory usage (no need to buffer complete response)

Potential concerns:

  • More Pulsar messages for streaming responses
  • Slightly higher CPU for chunking/callback overhead
  • Mitigated by: streaming is opt-in, default remains non-streaming

Testing considerations:

  • Test with large knowledge graphs (many triples)
  • Test with many retrieved documents
  • Measure overhead of streaming vs non-streaming

Testing Strategy

Unit tests:

  • Test GraphRag.query() with streaming=True/False
  • Test DocumentRag.query() with streaming=True/False
  • Mock PromptClient to verify callback invocations

Integration tests:

  • Test full GraphRAG streaming flow (similar to existing agent streaming tests)
  • Test full DocumentRAG streaming flow
  • Test Gateway streaming forwarding
  • Test CLI streaming output

Manual testing:

  • tg-invoke-graph-rag -q "What is machine learning?" (streaming by default)
  • tg-invoke-document-rag -q "Summarize the documents about AI" (streaming by default)
  • tg-invoke-graph-rag --no-streaming -q "..." (test non-streaming mode)
  • Verify incremental output appears in streaming mode

Migration Plan

No migration needed:

  • Streaming is opt-in via streaming parameter (defaults to False)
  • Existing clients continue to work unchanged
  • New clients can opt into streaming

Timeline

Estimated implementation: 4-6 hours

  • Phase 1 (2 hours): GraphRAG streaming support
  • Phase 2 (2 hours): DocumentRAG streaming support
  • Phase 3 (1-2 hours): Gateway updates and CLI flags
  • Testing: Built into each phase

Open Questions

  • Should we add streaming support to NLP Query service as well?
  • Do we want to stream intermediate steps (e.g., "Retrieving entities...", "Querying graph...") or just LLM output?
  • Should GraphRAG/DocumentRAG responses include chunk metadata (e.g., chunk number, total expected)?

References

  • Existing implementation: docs/tech-specs/streaming-llm-responses.md
  • Agent streaming: trustgraph-flow/trustgraph/agent/react/agent_manager.py
  • PromptClient streaming: trustgraph-base/trustgraph/base/prompt_client.py