* Tidy up duplicate tech specs in doc directory * Streaming LLM text-completion service tech spec. * text-completion and prompt interfaces * streaming change applied to all LLMs, so far tested with VertexAI * Skip Pinecone unit tests, upstream module issue is affecting things, tests are passing again * Added agent streaming, not working and has broken tests
18 KiB
Streaming LLM Responses Technical Specification
Overview
This specification describes the implementation of streaming support for LLM responses in TrustGraph. Streaming enables real-time delivery of generated tokens as they are produced by the LLM, rather than waiting for complete response generation.
This implementation supports the following use cases:
- Real-time User Interfaces: Stream tokens to UI as they are generated, providing immediate visual feedback
- Reduced Time-to-First-Token: Users see output beginning immediately rather than waiting for full generation
- Long Response Handling: Handle very long outputs that might otherwise timeout or exceed memory limits
- Interactive Applications: Enable responsive chat and agent interfaces
Goals
- Backward Compatibility: Existing non-streaming clients continue to work without modification
- Consistent API Design: Streaming and non-streaming use the same schema patterns with minimal divergence
- Provider Flexibility: Support streaming where available, graceful fallback where not
- Phased Rollout: Incremental implementation to reduce risk
- End-to-End Support: Streaming from LLM provider through to client applications via Pulsar, Gateway API, and Python API
Background
Current Architecture
The current LLM text completion flow operates as follows:
- Client sends
TextCompletionRequestwithsystemandpromptfields - LLM service processes the request and waits for complete generation
- Single
TextCompletionResponsereturned with completeresponsestring
Current schema (trustgraph-base/trustgraph/schema/services/llm.py):
class TextCompletionRequest(Record):
system = String()
prompt = String()
class TextCompletionResponse(Record):
error = Error()
response = String()
in_token = Integer()
out_token = Integer()
model = String()
Current Limitations
- Latency: Users must wait for complete generation before seeing any output
- Timeout Risk: Long generations may exceed client timeout thresholds
- Poor UX: No feedback during generation creates perception of slowness
- Resource Usage: Full responses must be buffered in memory
This specification addresses these limitations by enabling incremental response delivery while maintaining full backward compatibility.
Technical Design
Phase 1: Infrastructure
Phase 1 establishes the foundation for streaming by modifying schemas, APIs, and CLI tools.
Schema Changes
LLM Schema (trustgraph-base/trustgraph/schema/services/llm.py)
Request Changes:
class TextCompletionRequest(Record):
system = String()
prompt = String()
streaming = Boolean() # NEW: Default false for backward compatibility
streaming: Whentrue, requests streaming response delivery- Default:
false(existing behavior preserved)
Response Changes:
class TextCompletionResponse(Record):
error = Error()
response = String()
in_token = Integer()
out_token = Integer()
model = String()
end_of_stream = Boolean() # NEW: Indicates final message
end_of_stream: Whentrue, indicates this is the final (or only) response- For non-streaming requests: Single response with
end_of_stream=true - For streaming requests: Multiple responses, all with
end_of_stream=falseexcept the final one
Prompt Schema (trustgraph-base/trustgraph/schema/services/prompt.py)
The prompt service wraps text completion, so it mirrors the same pattern:
Request Changes:
class PromptRequest(Record):
id = String()
terms = Map(String())
streaming = Boolean() # NEW: Default false
Response Changes:
class PromptResponse(Record):
error = Error()
text = String()
object = String()
end_of_stream = Boolean() # NEW: Indicates final message
Gateway API Changes
The Gateway API must expose streaming capabilities to HTTP/WebSocket clients.
REST API Updates:
POST /api/v1/text-completion: Acceptstreamingparameter in request body- Response behavior depends on streaming flag:
streaming=false: Single JSON response (current behavior)streaming=true: Server-Sent Events (SSE) stream or WebSocket messages
Response Format (Streaming):
Each streamed chunk follows the same schema structure:
{
"response": "partial text...",
"end_of_stream": false,
"model": "model-name"
}
Final chunk:
{
"response": "final text chunk",
"end_of_stream": true,
"in_token": 150,
"out_token": 500,
"model": "model-name"
}
Python API Changes
The Python client API must support both streaming and non-streaming modes while maintaining backward compatibility.
LlmClient Updates (trustgraph-base/trustgraph/clients/llm_client.py):
class LlmClient(BaseClient):
def request(self, system, prompt, timeout=300, streaming=False):
"""
Non-streaming request (backward compatible).
Returns complete response string.
"""
# Existing behavior when streaming=False
async def request_stream(self, system, prompt, timeout=300):
"""
Streaming request.
Yields response chunks as they arrive.
"""
# New async generator method
PromptClient Updates (trustgraph-base/trustgraph/base/prompt_client.py):
Similar pattern with streaming parameter and async generator variant.
CLI Tool Changes
tg-invoke-llm (trustgraph-cli/trustgraph/cli/invoke_llm.py):
tg-invoke-llm [system] [prompt] [--no-streaming] [-u URL] [-f flow-id]
- Streaming enabled by default for better interactive UX
--no-streamingflag disables streaming- When streaming: Output tokens to stdout as they arrive
- When not streaming: Wait for complete response, then output
tg-invoke-prompt (trustgraph-cli/trustgraph/cli/invoke_prompt.py):
tg-invoke-prompt [template-id] [var=value...] [--no-streaming] [-u URL] [-f flow-id]
Same pattern as tg-invoke-llm.
LLM Service Base Class Changes
LlmService (trustgraph-base/trustgraph/base/llm_service.py):
class LlmService(FlowProcessor):
async def on_request(self, msg, consumer, flow):
request = msg.value()
streaming = getattr(request, 'streaming', False)
if streaming and self.supports_streaming():
async for chunk in self.generate_content_stream(...):
await self.send_response(chunk, end_of_stream=False)
await self.send_response(final_chunk, end_of_stream=True)
else:
response = await self.generate_content(...)
await self.send_response(response, end_of_stream=True)
def supports_streaming(self):
"""Override in subclass to indicate streaming support."""
return False
async def generate_content_stream(self, system, prompt, model, temperature):
"""Override in subclass to implement streaming."""
raise NotImplementedError()
Phase 2: VertexAI Proof of Concept
Phase 2 implements streaming in a single provider (VertexAI) to validate the infrastructure and enable end-to-end testing.
VertexAI Implementation
Module: trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py
Changes:
- Override
supports_streaming()to returnTrue - Implement
generate_content_stream()async generator - Handle both Gemini and Claude models (via VertexAI Anthropic API)
Gemini Streaming:
async def generate_content_stream(self, system, prompt, model, temperature):
model_instance = self.get_model(model, temperature)
response = model_instance.generate_content(
[system, prompt],
stream=True # Enable streaming
)
for chunk in response:
yield LlmChunk(
text=chunk.text,
in_token=None, # Available only in final chunk
out_token=None,
)
# Final chunk includes token counts from response.usage_metadata
Claude (via VertexAI Anthropic) Streaming:
async def generate_content_stream(self, system, prompt, model, temperature):
with self.anthropic_client.messages.stream(...) as stream:
for text in stream.text_stream:
yield LlmChunk(text=text)
# Token counts from stream.get_final_message()
Testing
- Unit tests for streaming response assembly
- Integration tests with VertexAI (Gemini and Claude)
- End-to-end tests: CLI -> Gateway -> Pulsar -> VertexAI -> back
- Backward compatibility tests: Non-streaming requests still work
Phase 3: All LLM Providers
Phase 3 extends streaming support to all LLM providers in the system.
Provider Implementation Status
Each provider must either:
- Full Streaming Support: Implement
generate_content_stream() - Compatibility Mode: Handle the
end_of_streamflag correctly (return single response withend_of_stream=true)
| Provider | Package | Streaming Support |
|---|---|---|
| OpenAI | trustgraph-flow | Full (native streaming API) |
| Claude/Anthropic | trustgraph-flow | Full (native streaming API) |
| Ollama | trustgraph-flow | Full (native streaming API) |
| Cohere | trustgraph-flow | Full (native streaming API) |
| Mistral | trustgraph-flow | Full (native streaming API) |
| Azure OpenAI | trustgraph-flow | Full (native streaming API) |
| Google AI Studio | trustgraph-flow | Full (native streaming API) |
| VertexAI | trustgraph-vertexai | Full (Phase 2) |
| Bedrock | trustgraph-bedrock | Full (native streaming API) |
| LM Studio | trustgraph-flow | Full (OpenAI-compatible) |
| LlamaFile | trustgraph-flow | Full (OpenAI-compatible) |
| vLLM | trustgraph-flow | Full (OpenAI-compatible) |
| TGI | trustgraph-flow | TBD |
| Azure | trustgraph-flow | TBD |
Implementation Pattern
For OpenAI-compatible providers (OpenAI, LM Studio, LlamaFile, vLLM):
async def generate_content_stream(self, system, prompt, model, temperature):
response = await self.client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": prompt}
],
temperature=temperature,
stream=True
)
async for chunk in response:
if chunk.choices[0].delta.content:
yield LlmChunk(text=chunk.choices[0].delta.content)
Phase 4: Agent API
Phase 4 extends streaming to the Agent API. This is more complex because the Agent API is already multi-message by nature (thought → action → observation → repeat → final answer).
Current Agent Schema
class AgentStep(Record):
thought = String()
action = String()
arguments = Map(String())
observation = String()
user = String()
class AgentRequest(Record):
question = String()
state = String()
group = Array(String())
history = Array(AgentStep())
user = String()
class AgentResponse(Record):
answer = String()
error = Error()
thought = String()
observation = String()
Proposed Agent Schema Changes
Request Changes:
class AgentRequest(Record):
question = String()
state = String()
group = Array(String())
history = Array(AgentStep())
user = String()
streaming = Boolean() # NEW: Default false
Response Changes:
The agent produces multiple types of output during its reasoning cycle:
- Thoughts (reasoning)
- Actions (tool calls)
- Observations (tool results)
- Answer (final response)
- Errors
Since chunk_type identifies what kind of content is being sent, the separate
answer, error, thought, and observation fields can be collapsed into
a single content field:
class AgentResponse(Record):
chunk_type = String() # "thought", "action", "observation", "answer", "error"
content = String() # The actual content (interpretation depends on chunk_type)
end_of_message = Boolean() # Current thought/action/observation/answer is complete
end_of_dialog = Boolean() # Entire agent dialog is complete
Field Semantics:
-
chunk_type: Indicates what type of content is in thecontentfield"thought": Agent reasoning/thinking"action": Tool/action being invoked"observation": Result from tool execution"answer": Final answer to the user's question"error": Error message
-
content: The actual streamed content, interpreted based onchunk_type -
end_of_message: Whentrue, the current chunk type is complete- Example: All tokens for the current thought have been sent
- Allows clients to know when to move to the next stage
-
end_of_dialog: Whentrue, the entire agent interaction is complete- This is the final message in the stream
Agent Streaming Behavior
When streaming=true:
- Thought streaming:
- Multiple chunks with
chunk_type="thought",end_of_message=false - Final thought chunk has
end_of_message=true
- Multiple chunks with
- Action notification:
- Single chunk with
chunk_type="action",end_of_message=true
- Single chunk with
- Observation:
- Chunk(s) with
chunk_type="observation", final hasend_of_message=true
- Chunk(s) with
- Repeat steps 1-3 as the agent reasons
- Final answer:
chunk_type="answer"with the final response incontent- Last chunk has
end_of_message=true,end_of_dialog=true
Example Stream Sequence:
{chunk_type: "thought", content: "I need to", end_of_message: false, end_of_dialog: false}
{chunk_type: "thought", content: " search for...", end_of_message: true, end_of_dialog: false}
{chunk_type: "action", content: "search", end_of_message: true, end_of_dialog: false}
{chunk_type: "observation", content: "Found: ...", end_of_message: true, end_of_dialog: false}
{chunk_type: "thought", content: "Based on this", end_of_message: false, end_of_dialog: false}
{chunk_type: "thought", content: " I can answer...", end_of_message: true, end_of_dialog: false}
{chunk_type: "answer", content: "The answer is...", end_of_message: true, end_of_dialog: true}
When streaming=false:
- Current behavior preserved
- Single response with complete answer
end_of_message=true,end_of_dialog=true
Gateway and Python API
- Gateway: New SSE/WebSocket endpoint for agent streaming
- Python API: New
agent_stream()async generator method
Security Considerations
- No new attack surface: Streaming uses same authentication/authorization
- Rate limiting: Apply per-token or per-chunk rate limits if needed
- Connection handling: Properly terminate streams on client disconnect
- Timeout management: Streaming requests need appropriate timeout handling
Performance Considerations
- Memory: Streaming reduces peak memory usage (no full response buffering)
- Latency: Time-to-first-token significantly reduced
- Connection overhead: SSE/WebSocket connections have keep-alive overhead
- Pulsar throughput: Multiple small messages vs. single large message tradeoff
Testing Strategy
Unit Tests
- Schema serialization/deserialization with new fields
- Backward compatibility (missing fields use defaults)
- Chunk assembly logic
Integration Tests
- Each LLM provider's streaming implementation
- Gateway API streaming endpoints
- Python client streaming methods
End-to-End Tests
- CLI tool streaming output
- Full flow: Client → Gateway → Pulsar → LLM → back
- Mixed streaming/non-streaming workloads
Backward Compatibility Tests
- Existing clients work without modification
- Non-streaming requests behave identically
Migration Plan
Phase 1: Infrastructure
- Deploy schema changes (backward compatible)
- Deploy Gateway API updates
- Deploy Python API updates
- Release CLI tool updates
Phase 2: VertexAI
- Deploy VertexAI streaming implementation
- Validate with test workloads
Phase 3: All Providers
- Roll out provider updates incrementally
- Monitor for issues
Phase 4: Agent API
- Deploy agent schema changes
- Deploy agent streaming implementation
- Update documentation
Timeline
| Phase | Description | Dependencies |
|---|---|---|
| Phase 1 | Infrastructure | None |
| Phase 2 | VertexAI PoC | Phase 1 |
| Phase 3 | All Providers | Phase 2 |
| Phase 4 | Agent API | Phase 3 |
Design Decisions
The following questions were resolved during specification:
-
Token Counts in Streaming: Token counts are deltas, not running totals. Consumers can sum them if needed. This matches how most providers report usage and simplifies the implementation.
-
Error Handling in Streams: If an error occurs, the
errorfield is populated and no other fields are needed. An error is always the final communication - no subsequent messages are permitted or expected after an error. For LLM/Prompt streams,end_of_stream=true. For Agent streams,chunk_type="error"withend_of_dialog=true. -
Partial Response Recovery: The messaging protocol (Pulsar) is resilient, so message-level retry is not needed. If a client loses track of the stream or disconnects, it must retry the full request from scratch.
-
Prompt Service Streaming: Streaming is only supported for text (
text) responses, not structured (object) responses. The prompt service knows at the outset whether the output will be JSON or text based on the prompt template. If a streaming request is made for a JSON-output prompt, the service should either:- Return the complete JSON in a single response with
end_of_stream=true, or - Reject the streaming request with an error
- Return the complete JSON in a single response with
Open Questions
None at this time.
References
- Current LLM schema:
trustgraph-base/trustgraph/schema/services/llm.py - Current prompt schema:
trustgraph-base/trustgraph/schema/services/prompt.py - Current agent schema:
trustgraph-base/trustgraph/schema/services/agent.py - LLM service base:
trustgraph-base/trustgraph/base/llm_service.py - VertexAI provider:
trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py - Gateway API:
trustgraph-base/trustgraph/api/ - CLI tools:
trustgraph-cli/trustgraph/cli/