Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.
18 KiB
| layout | title | parent |
|---|---|---|
| default | Streaming LLM Responses Technical Specification | Tech Specs |
Streaming LLM Responses Technical Specification
Overview
This specification describes the implementation of streaming support for LLM responses in TrustGraph. Streaming enables real-time delivery of generated tokens as they are produced by the LLM, rather than waiting for complete response generation.
This implementation supports the following use cases:
- Real-time User Interfaces: Stream tokens to UI as they are generated, providing immediate visual feedback
- Reduced Time-to-First-Token: Users see output beginning immediately rather than waiting for full generation
- Long Response Handling: Handle very long outputs that might otherwise timeout or exceed memory limits
- Interactive Applications: Enable responsive chat and agent interfaces
Goals
- Backward Compatibility: Existing non-streaming clients continue to work without modification
- Consistent API Design: Streaming and non-streaming use the same schema patterns with minimal divergence
- Provider Flexibility: Support streaming where available, graceful fallback where not
- Phased Rollout: Incremental implementation to reduce risk
- End-to-End Support: Streaming from LLM provider through to client applications via Pulsar, Gateway API, and Python API
Background
Current Architecture
The current LLM text completion flow operates as follows:
- Client sends
TextCompletionRequestwithsystemandpromptfields - LLM service processes the request and waits for complete generation
- Single
TextCompletionResponsereturned with completeresponsestring
Current schema (trustgraph-base/trustgraph/schema/services/llm.py):
class TextCompletionRequest(Record):
system = String()
prompt = String()
class TextCompletionResponse(Record):
error = Error()
response = String()
in_token = Integer()
out_token = Integer()
model = String()
Current Limitations
- Latency: Users must wait for complete generation before seeing any output
- Timeout Risk: Long generations may exceed client timeout thresholds
- Poor UX: No feedback during generation creates perception of slowness
- Resource Usage: Full responses must be buffered in memory
This specification addresses these limitations by enabling incremental response delivery while maintaining full backward compatibility.
Technical Design
Phase 1: Infrastructure
Phase 1 establishes the foundation for streaming by modifying schemas, APIs, and CLI tools.
Schema Changes
LLM Schema (trustgraph-base/trustgraph/schema/services/llm.py)
Request Changes:
class TextCompletionRequest(Record):
system = String()
prompt = String()
streaming = Boolean() # NEW: Default false for backward compatibility
streaming: Whentrue, requests streaming response delivery- Default:
false(existing behavior preserved)
Response Changes:
class TextCompletionResponse(Record):
error = Error()
response = String()
in_token = Integer()
out_token = Integer()
model = String()
end_of_stream = Boolean() # NEW: Indicates final message
end_of_stream: Whentrue, indicates this is the final (or only) response- For non-streaming requests: Single response with
end_of_stream=true - For streaming requests: Multiple responses, all with
end_of_stream=falseexcept the final one
Prompt Schema (trustgraph-base/trustgraph/schema/services/prompt.py)
The prompt service wraps text completion, so it mirrors the same pattern:
Request Changes:
class PromptRequest(Record):
id = String()
terms = Map(String())
streaming = Boolean() # NEW: Default false
Response Changes:
class PromptResponse(Record):
error = Error()
text = String()
object = String()
end_of_stream = Boolean() # NEW: Indicates final message
Gateway API Changes
The Gateway API must expose streaming capabilities to HTTP/WebSocket clients.
REST API Updates:
POST /api/v1/text-completion: Acceptstreamingparameter in request body- Response behavior depends on streaming flag:
streaming=false: Single JSON response (current behavior)streaming=true: Server-Sent Events (SSE) stream or WebSocket messages
Response Format (Streaming):
Each streamed chunk follows the same schema structure:
{
"response": "partial text...",
"end_of_stream": false,
"model": "model-name"
}
Final chunk:
{
"response": "final text chunk",
"end_of_stream": true,
"in_token": 150,
"out_token": 500,
"model": "model-name"
}
Python API Changes
The Python client API must support both streaming and non-streaming modes while maintaining backward compatibility.
LlmClient Updates (trustgraph-base/trustgraph/clients/llm_client.py):
class LlmClient(BaseClient):
def request(self, system, prompt, timeout=300, streaming=False):
"""
Non-streaming request (backward compatible).
Returns complete response string.
"""
# Existing behavior when streaming=False
async def request_stream(self, system, prompt, timeout=300):
"""
Streaming request.
Yields response chunks as they arrive.
"""
# New async generator method
PromptClient Updates (trustgraph-base/trustgraph/base/prompt_client.py):
Similar pattern with streaming parameter and async generator variant.
CLI Tool Changes
tg-invoke-llm (trustgraph-cli/trustgraph/cli/invoke_llm.py):
tg-invoke-llm [system] [prompt] [--no-streaming] [-u URL] [-f flow-id]
- Streaming enabled by default for better interactive UX
--no-streamingflag disables streaming- When streaming: Output tokens to stdout as they arrive
- When not streaming: Wait for complete response, then output
tg-invoke-prompt (trustgraph-cli/trustgraph/cli/invoke_prompt.py):
tg-invoke-prompt [template-id] [var=value...] [--no-streaming] [-u URL] [-f flow-id]
Same pattern as tg-invoke-llm.
LLM Service Base Class Changes
LlmService (trustgraph-base/trustgraph/base/llm_service.py):
class LlmService(FlowProcessor):
async def on_request(self, msg, consumer, flow):
request = msg.value()
streaming = getattr(request, 'streaming', False)
if streaming and self.supports_streaming():
async for chunk in self.generate_content_stream(...):
await self.send_response(chunk, end_of_stream=False)
await self.send_response(final_chunk, end_of_stream=True)
else:
response = await self.generate_content(...)
await self.send_response(response, end_of_stream=True)
def supports_streaming(self):
"""Override in subclass to indicate streaming support."""
return False
async def generate_content_stream(self, system, prompt, model, temperature):
"""Override in subclass to implement streaming."""
raise NotImplementedError()
Phase 2: VertexAI Proof of Concept
Phase 2 implements streaming in a single provider (VertexAI) to validate the infrastructure and enable end-to-end testing.
VertexAI Implementation
Module: trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py
Changes:
- Override
supports_streaming()to returnTrue - Implement
generate_content_stream()async generator - Handle both Gemini and Claude models (via VertexAI Anthropic API)
Gemini Streaming:
async def generate_content_stream(self, system, prompt, model, temperature):
model_instance = self.get_model(model, temperature)
response = model_instance.generate_content(
[system, prompt],
stream=True # Enable streaming
)
for chunk in response:
yield LlmChunk(
text=chunk.text,
in_token=None, # Available only in final chunk
out_token=None,
)
# Final chunk includes token counts from response.usage_metadata
Claude (via VertexAI Anthropic) Streaming:
async def generate_content_stream(self, system, prompt, model, temperature):
with self.anthropic_client.messages.stream(...) as stream:
for text in stream.text_stream:
yield LlmChunk(text=text)
# Token counts from stream.get_final_message()
Testing
- Unit tests for streaming response assembly
- Integration tests with VertexAI (Gemini and Claude)
- End-to-end tests: CLI -> Gateway -> Pulsar -> VertexAI -> back
- Backward compatibility tests: Non-streaming requests still work
Phase 3: All LLM Providers
Phase 3 extends streaming support to all LLM providers in the system.
Provider Implementation Status
Each provider must either:
- Full Streaming Support: Implement
generate_content_stream() - Compatibility Mode: Handle the
end_of_streamflag correctly (return single response withend_of_stream=true)
| Provider | Package | Streaming Support |
|---|---|---|
| OpenAI | trustgraph-flow | Full (native streaming API) |
| Claude/Anthropic | trustgraph-flow | Full (native streaming API) |
| Ollama | trustgraph-flow | Full (native streaming API) |
| Cohere | trustgraph-flow | Full (native streaming API) |
| Mistral | trustgraph-flow | Full (native streaming API) |
| Azure OpenAI | trustgraph-flow | Full (native streaming API) |
| Google AI Studio | trustgraph-flow | Full (native streaming API) |
| VertexAI | trustgraph-vertexai | Full (Phase 2) |
| Bedrock | trustgraph-bedrock | Full (native streaming API) |
| LM Studio | trustgraph-flow | Full (OpenAI-compatible) |
| LlamaFile | trustgraph-flow | Full (OpenAI-compatible) |
| vLLM | trustgraph-flow | Full (OpenAI-compatible) |
| TGI | trustgraph-flow | TBD |
| Azure | trustgraph-flow | TBD |
Implementation Pattern
For OpenAI-compatible providers (OpenAI, LM Studio, LlamaFile, vLLM):
async def generate_content_stream(self, system, prompt, model, temperature):
response = await self.client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": prompt}
],
temperature=temperature,
stream=True
)
async for chunk in response:
if chunk.choices[0].delta.content:
yield LlmChunk(text=chunk.choices[0].delta.content)
Phase 4: Agent API
Phase 4 extends streaming to the Agent API. This is more complex because the Agent API is already multi-message by nature (thought → action → observation → repeat → final answer).
Current Agent Schema
class AgentStep(Record):
thought = String()
action = String()
arguments = Map(String())
observation = String()
user = String()
class AgentRequest(Record):
question = String()
state = String()
group = Array(String())
history = Array(AgentStep())
user = String()
class AgentResponse(Record):
answer = String()
error = Error()
thought = String()
observation = String()
Proposed Agent Schema Changes
Request Changes:
class AgentRequest(Record):
question = String()
state = String()
group = Array(String())
history = Array(AgentStep())
user = String()
streaming = Boolean() # NEW: Default false
Response Changes:
The agent produces multiple types of output during its reasoning cycle:
- Thoughts (reasoning)
- Actions (tool calls)
- Observations (tool results)
- Answer (final response)
- Errors
Since message_type identifies what kind of content is being sent, the separate
answer, error, thought, and observation fields can be collapsed into
a single content field:
class AgentResponse(Record):
message_type = String() # "thought", "action", "observation", "answer", "error"
content = String() # The actual content (interpretation depends on message_type)
end_of_message = Boolean() # Current thought/action/observation/answer is complete
end_of_dialog = Boolean() # Entire agent dialog is complete
Field Semantics:
-
message_type: Indicates what type of content is in thecontentfield"thought": Agent reasoning/thinking"action": Tool/action being invoked"observation": Result from tool execution"answer": Final answer to the user's question"error": Error message
-
content: The actual streamed content, interpreted based onmessage_type -
end_of_message: Whentrue, the current chunk type is complete- Example: All tokens for the current thought have been sent
- Allows clients to know when to move to the next stage
-
end_of_dialog: Whentrue, the entire agent interaction is complete- This is the final message in the stream
Agent Streaming Behavior
When streaming=true:
- Thought streaming:
- Multiple chunks with
message_type="thought",end_of_message=false - Final thought chunk has
end_of_message=true
- Multiple chunks with
- Action notification:
- Single chunk with
message_type="action",end_of_message=true
- Single chunk with
- Observation:
- Chunk(s) with
message_type="observation", final hasend_of_message=true
- Chunk(s) with
- Repeat steps 1-3 as the agent reasons
- Final answer:
message_type="answer"with the final response incontent- Last chunk has
end_of_message=true,end_of_dialog=true
Example Stream Sequence:
{message_type: "thought", content: "I need to", end_of_message: false, end_of_dialog: false}
{message_type: "thought", content: " search for...", end_of_message: true, end_of_dialog: false}
{message_type: "action", content: "search", end_of_message: true, end_of_dialog: false}
{message_type: "observation", content: "Found: ...", end_of_message: true, end_of_dialog: false}
{message_type: "thought", content: "Based on this", end_of_message: false, end_of_dialog: false}
{message_type: "thought", content: " I can answer...", end_of_message: true, end_of_dialog: false}
{message_type: "answer", content: "The answer is...", end_of_message: true, end_of_dialog: true}
When streaming=false:
- Current behavior preserved
- Single response with complete answer
end_of_message=true,end_of_dialog=true
Gateway and Python API
- Gateway: New SSE/WebSocket endpoint for agent streaming
- Python API: New
agent_stream()async generator method
Security Considerations
- No new attack surface: Streaming uses same authentication/authorization
- Rate limiting: Apply per-token or per-chunk rate limits if needed
- Connection handling: Properly terminate streams on client disconnect
- Timeout management: Streaming requests need appropriate timeout handling
Performance Considerations
- Memory: Streaming reduces peak memory usage (no full response buffering)
- Latency: Time-to-first-token significantly reduced
- Connection overhead: SSE/WebSocket connections have keep-alive overhead
- Pulsar throughput: Multiple small messages vs. single large message tradeoff
Testing Strategy
Unit Tests
- Schema serialization/deserialization with new fields
- Backward compatibility (missing fields use defaults)
- Chunk assembly logic
Integration Tests
- Each LLM provider's streaming implementation
- Gateway API streaming endpoints
- Python client streaming methods
End-to-End Tests
- CLI tool streaming output
- Full flow: Client → Gateway → Pulsar → LLM → back
- Mixed streaming/non-streaming workloads
Backward Compatibility Tests
- Existing clients work without modification
- Non-streaming requests behave identically
Migration Plan
Phase 1: Infrastructure
- Deploy schema changes (backward compatible)
- Deploy Gateway API updates
- Deploy Python API updates
- Release CLI tool updates
Phase 2: VertexAI
- Deploy VertexAI streaming implementation
- Validate with test workloads
Phase 3: All Providers
- Roll out provider updates incrementally
- Monitor for issues
Phase 4: Agent API
- Deploy agent schema changes
- Deploy agent streaming implementation
- Update documentation
Timeline
| Phase | Description | Dependencies |
|---|---|---|
| Phase 1 | Infrastructure | None |
| Phase 2 | VertexAI PoC | Phase 1 |
| Phase 3 | All Providers | Phase 2 |
| Phase 4 | Agent API | Phase 3 |
Design Decisions
The following questions were resolved during specification:
-
Token Counts in Streaming: Token counts are deltas, not running totals. Consumers can sum them if needed. This matches how most providers report usage and simplifies the implementation.
-
Error Handling in Streams: If an error occurs, the
errorfield is populated and no other fields are needed. An error is always the final communication - no subsequent messages are permitted or expected after an error. For LLM/Prompt streams,end_of_stream=true. For Agent streams,message_type="error"withend_of_dialog=true. -
Partial Response Recovery: The messaging protocol (Pulsar) is resilient, so message-level retry is not needed. If a client loses track of the stream or disconnects, it must retry the full request from scratch.
-
Prompt Service Streaming: Streaming is only supported for text (
text) responses, not structured (object) responses. The prompt service knows at the outset whether the output will be JSON or text based on the prompt template. If a streaming request is made for a JSON-output prompt, the service should either:- Return the complete JSON in a single response with
end_of_stream=true, or - Reject the streaming request with an error
- Return the complete JSON in a single response with
Open Questions
None at this time.
References
- Current LLM schema:
trustgraph-base/trustgraph/schema/services/llm.py - Current prompt schema:
trustgraph-base/trustgraph/schema/services/prompt.py - Current agent schema:
trustgraph-base/trustgraph/schema/services/agent.py - LLM service base:
trustgraph-base/trustgraph/base/llm_service.py - VertexAI provider:
trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py - Gateway API:
trustgraph-base/trustgraph/api/ - CLI tools:
trustgraph-cli/trustgraph/cli/