mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-04-25 16:36:21 +02:00

cybermaggedon 1948edaa50

* Tech spec for streaming RAG

* Support for streaming Graph/Doc RAG

2025-11-26 19:47:39 +00:00

10 KiB

Raw Blame History

RAG Streaming Support Technical Specification

Overview

This specification describes adding streaming support to GraphRAG and DocumentRAG services, enabling real-time token-by-token responses for knowledge graph and document retrieval queries. This extends the existing streaming architecture already implemented for LLM text-completion, prompt, and agent services.

Goals

Consistent streaming UX: Provide the same streaming experience across all TrustGraph services
Minimal API changes: Add streaming support with a single streaming flag, following established patterns
Backward compatibility: Maintain existing non-streaming behavior as default
Reuse existing infrastructure: Leverage PromptClient streaming already implemented
Gateway support: Enable streaming through websocket gateway for client applications

Background

Currently implemented streaming services:

LLM text-completion service: Phase 1 - streaming from LLM providers
Prompt service: Phase 2 - streaming through prompt templates
Agent service: Phase 3-4 - streaming ReAct responses with incremental thought/observation/answer chunks

Current limitations for RAG services:

GraphRAG and DocumentRAG only support blocking responses
Users must wait for complete LLM response before seeing any output
Poor UX for long responses from knowledge graph or document queries
Inconsistent experience compared to other TrustGraph services

This specification addresses these gaps by adding streaming support to GraphRAG and DocumentRAG. By enabling token-by-token responses, TrustGraph can:

Provide consistent streaming UX across all query types
Reduce perceived latency for RAG queries
Enable better progress feedback for long-running queries
Support real-time display in client applications

Technical Design

Architecture

The RAG streaming implementation leverages existing infrastructure:

PromptClient Streaming (Already implemented)
- kg_prompt() and document_prompt() already accept streaming and chunk_callback parameters
- These call prompt() internally with streaming support
- No changes needed to PromptClient
Module: trustgraph-base/trustgraph/base/prompt_client.py
GraphRAG Service (Needs streaming parameter pass-through)
- Add streaming parameter to query() method
- Pass streaming flag and callbacks to prompt_client.kg_prompt()
- GraphRagRequest schema needs streaming field
Modules:
- trustgraph-flow/trustgraph/retrieval/graph_rag/graph_rag.py
- trustgraph-flow/trustgraph/retrieval/graph_rag/rag.py (Processor)
- trustgraph-base/trustgraph/schema/graph_rag.py (Request schema)
- trustgraph-flow/trustgraph/gateway/dispatch/graph_rag.py (Gateway)
DocumentRAG Service (Needs streaming parameter pass-through)
- Add streaming parameter to query() method
- Pass streaming flag and callbacks to prompt_client.document_prompt()
- DocumentRagRequest schema needs streaming field
Modules:
- trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py
- trustgraph-flow/trustgraph/retrieval/document_rag/rag.py (Processor)
- trustgraph-base/trustgraph/schema/document_rag.py (Request schema)
- trustgraph-flow/trustgraph/gateway/dispatch/document_rag.py (Gateway)

Data Flow

Non-streaming (current):

Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=False)
                                   ↓
                                Prompt Service → LLM
                                   ↓
                                Complete response
                                   ↓
Client ← Gateway ← RAG Service ←  Response

Streaming (proposed):

Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=True, chunk_callback=cb)
                                   ↓
                                Prompt Service → LLM (streaming)
                                   ↓
                                Chunk → callback → RAG Response (chunk)
                                   ↓                       ↓
Client ← Gateway ← ────────────────────────────────── Response stream

APIs

GraphRAG Changes:

GraphRag.query() - Add streaming parameters

async def query(
    self, query, user, collection,
    verbose=False, streaming=False, chunk_callback=None  # NEW
):
    # ... existing entity/triple retrieval ...

    if streaming and chunk_callback:
        resp = await self.prompt_client.kg_prompt(
            query, kg,
            streaming=True,
            chunk_callback=chunk_callback
        )
    else:
        resp = await self.prompt_client.kg_prompt(query, kg)

    return resp

GraphRagRequest schema - Add streaming field

class GraphRagRequest(Record):
    query = String()
    user = String()
    collection = String()
    streaming = Boolean()  # NEW

GraphRagResponse schema - Add streaming fields (follow Agent pattern)

class GraphRagResponse(Record):
    response = String()    # Legacy: complete response
    chunk = String()       # NEW: streaming chunk
    end_of_stream = Boolean()  # NEW: indicates last chunk

Processor - Pass streaming through

async def handle(self, msg):
    # ... existing code ...

    async def send_chunk(chunk):
        await self.respond(GraphRagResponse(
            chunk=chunk,
            end_of_stream=False,
            response=None
        ))

    if request.streaming:
        full_response = await self.rag.query(
            query=request.query,
            user=request.user,
            collection=request.collection,
            streaming=True,
            chunk_callback=send_chunk
        )
        # Send final message
        await self.respond(GraphRagResponse(
            chunk=None,
            end_of_stream=True,
            response=full_response
        ))
    else:
        # Existing non-streaming path
        response = await self.rag.query(...)
        await self.respond(GraphRagResponse(response=response))

DocumentRAG Changes:

Identical pattern to GraphRAG:

Add streaming and chunk_callback parameters to DocumentRag.query()
Add streaming field to DocumentRagRequest
Add chunk and end_of_stream fields to DocumentRagResponse
Update Processor to handle streaming with callbacks

Gateway Changes:

Both graph_rag.py and document_rag.py in gateway/dispatch need updates to forward streaming chunks to websocket:

async def handle(self, message, session, websocket):
    # ... existing code ...

    if request.streaming:
        async def recipient(resp):
            if resp.chunk:
                await websocket.send(json.dumps({
                    "id": message["id"],
                    "response": {"chunk": resp.chunk},
                    "complete": resp.end_of_stream
                }))
            return resp.end_of_stream

        await self.rag_client.request(request, recipient=recipient)
    else:
        # Existing non-streaming path
        resp = await self.rag_client.request(request)
        await websocket.send(...)

Implementation Details

Implementation order:

Add schema fields (Request + Response for both RAG services)
Update GraphRag.query() and DocumentRag.query() methods
Update Processors to handle streaming
Update Gateway dispatch handlers
Add --no-streaming flags to tg-invoke-graph-rag and tg-invoke-document-rag (streaming enabled by default, following agent CLI pattern)

Callback pattern: Follow the same async callback pattern established in Agent streaming:

Processor defines async def send_chunk(chunk) callback
Passes callback to RAG service
RAG service passes callback to PromptClient
PromptClient invokes callback for each LLM chunk
Processor sends streaming response message for each chunk

Error handling:

Errors during streaming should send error response with end_of_stream=True
Follow existing error propagation patterns from Agent streaming

Security Considerations

No new security considerations beyond existing RAG services:

Streaming responses use same user/collection isolation
No changes to authentication or authorization
Chunk boundaries don't expose sensitive data

Performance Considerations

Benefits:

Reduced perceived latency (first tokens arrive faster)
Better UX for long responses
Lower memory usage (no need to buffer complete response)

Potential concerns:

More Pulsar messages for streaming responses
Slightly higher CPU for chunking/callback overhead
Mitigated by: streaming is opt-in, default remains non-streaming

Testing considerations:

Test with large knowledge graphs (many triples)
Test with many retrieved documents
Measure overhead of streaming vs non-streaming

Testing Strategy

Unit tests:

Test GraphRag.query() with streaming=True/False
Test DocumentRag.query() with streaming=True/False
Mock PromptClient to verify callback invocations

Integration tests:

Test full GraphRAG streaming flow (similar to existing agent streaming tests)
Test full DocumentRAG streaming flow
Test Gateway streaming forwarding
Test CLI streaming output

Manual testing:

tg-invoke-graph-rag -q "What is machine learning?" (streaming by default)
tg-invoke-document-rag -q "Summarize the documents about AI" (streaming by default)
tg-invoke-graph-rag --no-streaming -q "..." (test non-streaming mode)
Verify incremental output appears in streaming mode

Migration Plan

No migration needed:

Streaming is opt-in via streaming parameter (defaults to False)
Existing clients continue to work unchanged
New clients can opt into streaming

Timeline

Estimated implementation: 4-6 hours

Phase 1 (2 hours): GraphRAG streaming support
Phase 2 (2 hours): DocumentRAG streaming support
Phase 3 (1-2 hours): Gateway updates and CLI flags
Testing: Built into each phase

Open Questions

Should we add streaming support to NLP Query service as well?
Do we want to stream intermediate steps (e.g., "Retrieving entities...", "Querying graph...") or just LLM output?
Should GraphRAG/DocumentRAG responses include chunk metadata (e.g., chunk number, total expected)?

References

Existing implementation: docs/tech-specs/streaming-llm-responses.md
Agent streaming: trustgraph-flow/trustgraph/agent/react/agent_manager.py
PromptClient streaming: trustgraph-base/trustgraph/base/prompt_client.py

10 KiB Raw Blame History