mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 16:36:21 +02:00
Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.
294 lines
10 KiB
Markdown
294 lines
10 KiB
Markdown
---
|
|
layout: default
|
|
title: "RAG Streaming Support Technical Specification"
|
|
parent: "Tech Specs"
|
|
---
|
|
|
|
# RAG Streaming Support Technical Specification
|
|
|
|
## Overview
|
|
|
|
This specification describes adding streaming support to GraphRAG and DocumentRAG services, enabling real-time token-by-token responses for knowledge graph and document retrieval queries. This extends the existing streaming architecture already implemented for LLM text-completion, prompt, and agent services.
|
|
|
|
## Goals
|
|
|
|
- **Consistent streaming UX**: Provide the same streaming experience across all TrustGraph services
|
|
- **Minimal API changes**: Add streaming support with a single `streaming` flag, following established patterns
|
|
- **Backward compatibility**: Maintain existing non-streaming behavior as default
|
|
- **Reuse existing infrastructure**: Leverage PromptClient streaming already implemented
|
|
- **Gateway support**: Enable streaming through websocket gateway for client applications
|
|
|
|
## Background
|
|
|
|
Currently implemented streaming services:
|
|
- **LLM text-completion service**: Phase 1 - streaming from LLM providers
|
|
- **Prompt service**: Phase 2 - streaming through prompt templates
|
|
- **Agent service**: Phase 3-4 - streaming ReAct responses with incremental thought/observation/answer chunks
|
|
|
|
Current limitations for RAG services:
|
|
- GraphRAG and DocumentRAG only support blocking responses
|
|
- Users must wait for complete LLM response before seeing any output
|
|
- Poor UX for long responses from knowledge graph or document queries
|
|
- Inconsistent experience compared to other TrustGraph services
|
|
|
|
This specification addresses these gaps by adding streaming support to GraphRAG and DocumentRAG. By enabling token-by-token responses, TrustGraph can:
|
|
- Provide consistent streaming UX across all query types
|
|
- Reduce perceived latency for RAG queries
|
|
- Enable better progress feedback for long-running queries
|
|
- Support real-time display in client applications
|
|
|
|
## Technical Design
|
|
|
|
### Architecture
|
|
|
|
The RAG streaming implementation leverages existing infrastructure:
|
|
|
|
1. **PromptClient Streaming** (Already implemented)
|
|
- `kg_prompt()` and `document_prompt()` already accept `streaming` and `chunk_callback` parameters
|
|
- These call `prompt()` internally with streaming support
|
|
- No changes needed to PromptClient
|
|
|
|
Module: `trustgraph-base/trustgraph/base/prompt_client.py`
|
|
|
|
2. **GraphRAG Service** (Needs streaming parameter pass-through)
|
|
- Add `streaming` parameter to `query()` method
|
|
- Pass streaming flag and callbacks to `prompt_client.kg_prompt()`
|
|
- GraphRagRequest schema needs `streaming` field
|
|
|
|
Modules:
|
|
- `trustgraph-flow/trustgraph/retrieval/graph_rag/graph_rag.py`
|
|
- `trustgraph-flow/trustgraph/retrieval/graph_rag/rag.py` (Processor)
|
|
- `trustgraph-base/trustgraph/schema/graph_rag.py` (Request schema)
|
|
- `trustgraph-flow/trustgraph/gateway/dispatch/graph_rag.py` (Gateway)
|
|
|
|
3. **DocumentRAG Service** (Needs streaming parameter pass-through)
|
|
- Add `streaming` parameter to `query()` method
|
|
- Pass streaming flag and callbacks to `prompt_client.document_prompt()`
|
|
- DocumentRagRequest schema needs `streaming` field
|
|
|
|
Modules:
|
|
- `trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py`
|
|
- `trustgraph-flow/trustgraph/retrieval/document_rag/rag.py` (Processor)
|
|
- `trustgraph-base/trustgraph/schema/document_rag.py` (Request schema)
|
|
- `trustgraph-flow/trustgraph/gateway/dispatch/document_rag.py` (Gateway)
|
|
|
|
### Data Flow
|
|
|
|
**Non-streaming (current)**:
|
|
```
|
|
Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=False)
|
|
↓
|
|
Prompt Service → LLM
|
|
↓
|
|
Complete response
|
|
↓
|
|
Client ← Gateway ← RAG Service ← Response
|
|
```
|
|
|
|
**Streaming (proposed)**:
|
|
```
|
|
Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=True, chunk_callback=cb)
|
|
↓
|
|
Prompt Service → LLM (streaming)
|
|
↓
|
|
Chunk → callback → RAG Response (chunk)
|
|
↓ ↓
|
|
Client ← Gateway ← ────────────────────────────────── Response stream
|
|
```
|
|
|
|
### APIs
|
|
|
|
**GraphRAG Changes**:
|
|
|
|
1. **GraphRag.query()** - Add streaming parameters
|
|
```python
|
|
async def query(
|
|
self, query, user, collection,
|
|
verbose=False, streaming=False, chunk_callback=None # NEW
|
|
):
|
|
# ... existing entity/triple retrieval ...
|
|
|
|
if streaming and chunk_callback:
|
|
resp = await self.prompt_client.kg_prompt(
|
|
query, kg,
|
|
streaming=True,
|
|
chunk_callback=chunk_callback
|
|
)
|
|
else:
|
|
resp = await self.prompt_client.kg_prompt(query, kg)
|
|
|
|
return resp
|
|
```
|
|
|
|
2. **GraphRagRequest schema** - Add streaming field
|
|
```python
|
|
class GraphRagRequest(Record):
|
|
query = String()
|
|
user = String()
|
|
collection = String()
|
|
streaming = Boolean() # NEW
|
|
```
|
|
|
|
3. **GraphRagResponse schema** - Add streaming fields (follow Agent pattern)
|
|
```python
|
|
class GraphRagResponse(Record):
|
|
response = String() # Legacy: complete response
|
|
chunk = String() # NEW: streaming chunk
|
|
end_of_stream = Boolean() # NEW: indicates last chunk
|
|
```
|
|
|
|
4. **Processor** - Pass streaming through
|
|
```python
|
|
async def handle(self, msg):
|
|
# ... existing code ...
|
|
|
|
async def send_chunk(chunk):
|
|
await self.respond(GraphRagResponse(
|
|
chunk=chunk,
|
|
end_of_stream=False,
|
|
response=None
|
|
))
|
|
|
|
if request.streaming:
|
|
full_response = await self.rag.query(
|
|
query=request.query,
|
|
user=request.user,
|
|
collection=request.collection,
|
|
streaming=True,
|
|
chunk_callback=send_chunk
|
|
)
|
|
# Send final message
|
|
await self.respond(GraphRagResponse(
|
|
chunk=None,
|
|
end_of_stream=True,
|
|
response=full_response
|
|
))
|
|
else:
|
|
# Existing non-streaming path
|
|
response = await self.rag.query(...)
|
|
await self.respond(GraphRagResponse(response=response))
|
|
```
|
|
|
|
**DocumentRAG Changes**:
|
|
|
|
Identical pattern to GraphRAG:
|
|
1. Add `streaming` and `chunk_callback` parameters to `DocumentRag.query()`
|
|
2. Add `streaming` field to `DocumentRagRequest`
|
|
3. Add `chunk` and `end_of_stream` fields to `DocumentRagResponse`
|
|
4. Update Processor to handle streaming with callbacks
|
|
|
|
**Gateway Changes**:
|
|
|
|
Both `graph_rag.py` and `document_rag.py` in gateway/dispatch need updates to forward streaming chunks to websocket:
|
|
|
|
```python
|
|
async def handle(self, message, session, websocket):
|
|
# ... existing code ...
|
|
|
|
if request.streaming:
|
|
async def recipient(resp):
|
|
if resp.chunk:
|
|
await websocket.send(json.dumps({
|
|
"id": message["id"],
|
|
"response": {"chunk": resp.chunk},
|
|
"complete": resp.end_of_stream
|
|
}))
|
|
return resp.end_of_stream
|
|
|
|
await self.rag_client.request(request, recipient=recipient)
|
|
else:
|
|
# Existing non-streaming path
|
|
resp = await self.rag_client.request(request)
|
|
await websocket.send(...)
|
|
```
|
|
|
|
### Implementation Details
|
|
|
|
**Implementation order**:
|
|
1. Add schema fields (Request + Response for both RAG services)
|
|
2. Update GraphRag.query() and DocumentRag.query() methods
|
|
3. Update Processors to handle streaming
|
|
4. Update Gateway dispatch handlers
|
|
5. Add `--no-streaming` flags to `tg-invoke-graph-rag` and `tg-invoke-document-rag` (streaming enabled by default, following agent CLI pattern)
|
|
|
|
**Callback pattern**:
|
|
Follow the same async callback pattern established in Agent streaming:
|
|
- Processor defines `async def send_chunk(chunk)` callback
|
|
- Passes callback to RAG service
|
|
- RAG service passes callback to PromptClient
|
|
- PromptClient invokes callback for each LLM chunk
|
|
- Processor sends streaming response message for each chunk
|
|
|
|
**Error handling**:
|
|
- Errors during streaming should send error response with `end_of_stream=True`
|
|
- Follow existing error propagation patterns from Agent streaming
|
|
|
|
## Security Considerations
|
|
|
|
No new security considerations beyond existing RAG services:
|
|
- Streaming responses use same user/collection isolation
|
|
- No changes to authentication or authorization
|
|
- Chunk boundaries don't expose sensitive data
|
|
|
|
## Performance Considerations
|
|
|
|
**Benefits**:
|
|
- Reduced perceived latency (first tokens arrive faster)
|
|
- Better UX for long responses
|
|
- Lower memory usage (no need to buffer complete response)
|
|
|
|
**Potential concerns**:
|
|
- More Pulsar messages for streaming responses
|
|
- Slightly higher CPU for chunking/callback overhead
|
|
- Mitigated by: streaming is opt-in, default remains non-streaming
|
|
|
|
**Testing considerations**:
|
|
- Test with large knowledge graphs (many triples)
|
|
- Test with many retrieved documents
|
|
- Measure overhead of streaming vs non-streaming
|
|
|
|
## Testing Strategy
|
|
|
|
**Unit tests**:
|
|
- Test GraphRag.query() with streaming=True/False
|
|
- Test DocumentRag.query() with streaming=True/False
|
|
- Mock PromptClient to verify callback invocations
|
|
|
|
**Integration tests**:
|
|
- Test full GraphRAG streaming flow (similar to existing agent streaming tests)
|
|
- Test full DocumentRAG streaming flow
|
|
- Test Gateway streaming forwarding
|
|
- Test CLI streaming output
|
|
|
|
**Manual testing**:
|
|
- `tg-invoke-graph-rag -q "What is machine learning?"` (streaming by default)
|
|
- `tg-invoke-document-rag -q "Summarize the documents about AI"` (streaming by default)
|
|
- `tg-invoke-graph-rag --no-streaming -q "..."` (test non-streaming mode)
|
|
- Verify incremental output appears in streaming mode
|
|
|
|
## Migration Plan
|
|
|
|
No migration needed:
|
|
- Streaming is opt-in via `streaming` parameter (defaults to False)
|
|
- Existing clients continue to work unchanged
|
|
- New clients can opt into streaming
|
|
|
|
## Timeline
|
|
|
|
Estimated implementation: 4-6 hours
|
|
- Phase 1 (2 hours): GraphRAG streaming support
|
|
- Phase 2 (2 hours): DocumentRAG streaming support
|
|
- Phase 3 (1-2 hours): Gateway updates and CLI flags
|
|
- Testing: Built into each phase
|
|
|
|
## Open Questions
|
|
|
|
- Should we add streaming support to NLP Query service as well?
|
|
- Do we want to stream intermediate steps (e.g., "Retrieving entities...", "Querying graph...") or just LLM output?
|
|
- Should GraphRAG/DocumentRAG responses include chunk metadata (e.g., chunk number, total expected)?
|
|
|
|
## References
|
|
|
|
- Existing implementation: `docs/tech-specs/streaming-llm-responses.md`
|
|
- Agent streaming: `trustgraph-flow/trustgraph/agent/react/agent_manager.py`
|
|
- PromptClient streaming: `trustgraph-base/trustgraph/base/prompt_client.py`
|