# Streaming LLM Responses Technical Specification ## Overview This specification describes the implementation of streaming support for LLM responses in TrustGraph. Streaming enables real-time delivery of generated tokens as they are produced by the LLM, rather than waiting for complete response generation. This implementation supports the following use cases: 1. **Real-time User Interfaces**: Stream tokens to UI as they are generated, providing immediate visual feedback 2. **Reduced Time-to-First-Token**: Users see output beginning immediately rather than waiting for full generation 3. **Long Response Handling**: Handle very long outputs that might otherwise timeout or exceed memory limits 4. **Interactive Applications**: Enable responsive chat and agent interfaces ## Goals - **Backward Compatibility**: Existing non-streaming clients continue to work without modification - **Consistent API Design**: Streaming and non-streaming use the same schema patterns with minimal divergence - **Provider Flexibility**: Support streaming where available, graceful fallback where not - **Phased Rollout**: Incremental implementation to reduce risk - **End-to-End Support**: Streaming from LLM provider through to client applications via Pulsar, Gateway API, and Python API ## Background ### Current Architecture The current LLM text completion flow operates as follows: 1. Client sends `TextCompletionRequest` with `system` and `prompt` fields 2. LLM service processes the request and waits for complete generation 3. Single `TextCompletionResponse` returned with complete `response` string Current schema (`trustgraph-base/trustgraph/schema/services/llm.py`): ```python class TextCompletionRequest(Record): system = String() prompt = String() class TextCompletionResponse(Record): error = Error() response = String() in_token = Integer() out_token = Integer() model = String() ``` ### Current Limitations - **Latency**: Users must wait for complete generation before seeing any output - **Timeout Risk**: Long generations may exceed client timeout thresholds - **Poor UX**: No feedback during generation creates perception of slowness - **Resource Usage**: Full responses must be buffered in memory This specification addresses these limitations by enabling incremental response delivery while maintaining full backward compatibility. ## Technical Design ### Phase 1: Infrastructure Phase 1 establishes the foundation for streaming by modifying schemas, APIs, and CLI tools. #### Schema Changes ##### LLM Schema (`trustgraph-base/trustgraph/schema/services/llm.py`) **Request Changes:** ```python class TextCompletionRequest(Record): system = String() prompt = String() streaming = Boolean() # NEW: Default false for backward compatibility ``` - `streaming`: When `true`, requests streaming response delivery - Default: `false` (existing behavior preserved) **Response Changes:** ```python class TextCompletionResponse(Record): error = Error() response = String() in_token = Integer() out_token = Integer() model = String() end_of_stream = Boolean() # NEW: Indicates final message ``` - `end_of_stream`: When `true`, indicates this is the final (or only) response - For non-streaming requests: Single response with `end_of_stream=true` - For streaming requests: Multiple responses, all with `end_of_stream=false` except the final one ##### Prompt Schema (`trustgraph-base/trustgraph/schema/services/prompt.py`) The prompt service wraps text completion, so it mirrors the same pattern: **Request Changes:** ```python class PromptRequest(Record): id = String() terms = Map(String()) streaming = Boolean() # NEW: Default false ``` **Response Changes:** ```python class PromptResponse(Record): error = Error() text = String() object = String() end_of_stream = Boolean() # NEW: Indicates final message ``` #### Gateway API Changes The Gateway API must expose streaming capabilities to HTTP/WebSocket clients. **REST API Updates:** - `POST /api/v1/text-completion`: Accept `streaming` parameter in request body - Response behavior depends on streaming flag: - `streaming=false`: Single JSON response (current behavior) - `streaming=true`: Server-Sent Events (SSE) stream or WebSocket messages **Response Format (Streaming):** Each streamed chunk follows the same schema structure: ```json { "response": "partial text...", "end_of_stream": false, "model": "model-name" } ``` Final chunk: ```json { "response": "final text chunk", "end_of_stream": true, "in_token": 150, "out_token": 500, "model": "model-name" } ``` #### Python API Changes The Python client API must support both streaming and non-streaming modes while maintaining backward compatibility. **LlmClient Updates** (`trustgraph-base/trustgraph/clients/llm_client.py`): ```python class LlmClient(BaseClient): def request(self, system, prompt, timeout=300, streaming=False): """ Non-streaming request (backward compatible). Returns complete response string. """ # Existing behavior when streaming=False async def request_stream(self, system, prompt, timeout=300): """ Streaming request. Yields response chunks as they arrive. """ # New async generator method ``` **PromptClient Updates** (`trustgraph-base/trustgraph/base/prompt_client.py`): Similar pattern with `streaming` parameter and async generator variant. #### CLI Tool Changes **tg-invoke-llm** (`trustgraph-cli/trustgraph/cli/invoke_llm.py`): ``` tg-invoke-llm [system] [prompt] [--no-streaming] [-u URL] [-f flow-id] ``` - Streaming enabled by default for better interactive UX - `--no-streaming` flag disables streaming - When streaming: Output tokens to stdout as they arrive - When not streaming: Wait for complete response, then output **tg-invoke-prompt** (`trustgraph-cli/trustgraph/cli/invoke_prompt.py`): ``` tg-invoke-prompt [template-id] [var=value...] [--no-streaming] [-u URL] [-f flow-id] ``` Same pattern as `tg-invoke-llm`. #### LLM Service Base Class Changes **LlmService** (`trustgraph-base/trustgraph/base/llm_service.py`): ```python class LlmService(FlowProcessor): async def on_request(self, msg, consumer, flow): request = msg.value() streaming = getattr(request, 'streaming', False) if streaming and self.supports_streaming(): async for chunk in self.generate_content_stream(...): await self.send_response(chunk, end_of_stream=False) await self.send_response(final_chunk, end_of_stream=True) else: response = await self.generate_content(...) await self.send_response(response, end_of_stream=True) def supports_streaming(self): """Override in subclass to indicate streaming support.""" return False async def generate_content_stream(self, system, prompt, model, temperature): """Override in subclass to implement streaming.""" raise NotImplementedError() ``` --- ### Phase 2: VertexAI Proof of Concept Phase 2 implements streaming in a single provider (VertexAI) to validate the infrastructure and enable end-to-end testing. #### VertexAI Implementation **Module:** `trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py` **Changes:** 1. Override `supports_streaming()` to return `True` 2. Implement `generate_content_stream()` async generator 3. Handle both Gemini and Claude models (via VertexAI Anthropic API) **Gemini Streaming:** ```python async def generate_content_stream(self, system, prompt, model, temperature): model_instance = self.get_model(model, temperature) response = model_instance.generate_content( [system, prompt], stream=True # Enable streaming ) for chunk in response: yield LlmChunk( text=chunk.text, in_token=None, # Available only in final chunk out_token=None, ) # Final chunk includes token counts from response.usage_metadata ``` **Claude (via VertexAI Anthropic) Streaming:** ```python async def generate_content_stream(self, system, prompt, model, temperature): with self.anthropic_client.messages.stream(...) as stream: for text in stream.text_stream: yield LlmChunk(text=text) # Token counts from stream.get_final_message() ``` #### Testing - Unit tests for streaming response assembly - Integration tests with VertexAI (Gemini and Claude) - End-to-end tests: CLI -> Gateway -> Pulsar -> VertexAI -> back - Backward compatibility tests: Non-streaming requests still work --- ### Phase 3: All LLM Providers Phase 3 extends streaming support to all LLM providers in the system. #### Provider Implementation Status Each provider must either: 1. **Full Streaming Support**: Implement `generate_content_stream()` 2. **Compatibility Mode**: Handle the `end_of_stream` flag correctly (return single response with `end_of_stream=true`) | Provider | Package | Streaming Support | |----------|---------|-------------------| | OpenAI | trustgraph-flow | Full (native streaming API) | | Claude/Anthropic | trustgraph-flow | Full (native streaming API) | | Ollama | trustgraph-flow | Full (native streaming API) | | Cohere | trustgraph-flow | Full (native streaming API) | | Mistral | trustgraph-flow | Full (native streaming API) | | Azure OpenAI | trustgraph-flow | Full (native streaming API) | | Google AI Studio | trustgraph-flow | Full (native streaming API) | | VertexAI | trustgraph-vertexai | Full (Phase 2) | | Bedrock | trustgraph-bedrock | Full (native streaming API) | | LM Studio | trustgraph-flow | Full (OpenAI-compatible) | | LlamaFile | trustgraph-flow | Full (OpenAI-compatible) | | vLLM | trustgraph-flow | Full (OpenAI-compatible) | | TGI | trustgraph-flow | TBD | | Azure | trustgraph-flow | TBD | #### Implementation Pattern For OpenAI-compatible providers (OpenAI, LM Studio, LlamaFile, vLLM): ```python async def generate_content_stream(self, system, prompt, model, temperature): response = await self.client.chat.completions.create( model=model, messages=[ {"role": "system", "content": system}, {"role": "user", "content": prompt} ], temperature=temperature, stream=True ) async for chunk in response: if chunk.choices[0].delta.content: yield LlmChunk(text=chunk.choices[0].delta.content) ``` --- ### Phase 4: Agent API Phase 4 extends streaming to the Agent API. This is more complex because the Agent API is already multi-message by nature (thought → action → observation → repeat → final answer). #### Current Agent Schema ```python class AgentStep(Record): thought = String() action = String() arguments = Map(String()) observation = String() user = String() class AgentRequest(Record): question = String() state = String() group = Array(String()) history = Array(AgentStep()) user = String() class AgentResponse(Record): answer = String() error = Error() thought = String() observation = String() ``` #### Proposed Agent Schema Changes **Request Changes:** ```python class AgentRequest(Record): question = String() state = String() group = Array(String()) history = Array(AgentStep()) user = String() streaming = Boolean() # NEW: Default false ``` **Response Changes:** The agent produces multiple types of output during its reasoning cycle: - Thoughts (reasoning) - Actions (tool calls) - Observations (tool results) - Answer (final response) - Errors Since `chunk_type` identifies what kind of content is being sent, the separate `answer`, `error`, `thought`, and `observation` fields can be collapsed into a single `content` field: ```python class AgentResponse(Record): chunk_type = String() # "thought", "action", "observation", "answer", "error" content = String() # The actual content (interpretation depends on chunk_type) end_of_message = Boolean() # Current thought/action/observation/answer is complete end_of_dialog = Boolean() # Entire agent dialog is complete ``` **Field Semantics:** - `chunk_type`: Indicates what type of content is in the `content` field - `"thought"`: Agent reasoning/thinking - `"action"`: Tool/action being invoked - `"observation"`: Result from tool execution - `"answer"`: Final answer to the user's question - `"error"`: Error message - `content`: The actual streamed content, interpreted based on `chunk_type` - `end_of_message`: When `true`, the current chunk type is complete - Example: All tokens for the current thought have been sent - Allows clients to know when to move to the next stage - `end_of_dialog`: When `true`, the entire agent interaction is complete - This is the final message in the stream #### Agent Streaming Behavior When `streaming=true`: 1. **Thought streaming**: - Multiple chunks with `chunk_type="thought"`, `end_of_message=false` - Final thought chunk has `end_of_message=true` 2. **Action notification**: - Single chunk with `chunk_type="action"`, `end_of_message=true` 3. **Observation**: - Chunk(s) with `chunk_type="observation"`, final has `end_of_message=true` 4. **Repeat** steps 1-3 as the agent reasons 5. **Final answer**: - `chunk_type="answer"` with the final response in `content` - Last chunk has `end_of_message=true`, `end_of_dialog=true` **Example Stream Sequence:** ``` {chunk_type: "thought", content: "I need to", end_of_message: false, end_of_dialog: false} {chunk_type: "thought", content: " search for...", end_of_message: true, end_of_dialog: false} {chunk_type: "action", content: "search", end_of_message: true, end_of_dialog: false} {chunk_type: "observation", content: "Found: ...", end_of_message: true, end_of_dialog: false} {chunk_type: "thought", content: "Based on this", end_of_message: false, end_of_dialog: false} {chunk_type: "thought", content: " I can answer...", end_of_message: true, end_of_dialog: false} {chunk_type: "answer", content: "The answer is...", end_of_message: true, end_of_dialog: true} ``` When `streaming=false`: - Current behavior preserved - Single response with complete answer - `end_of_message=true`, `end_of_dialog=true` #### Gateway and Python API - Gateway: New SSE/WebSocket endpoint for agent streaming - Python API: New `agent_stream()` async generator method --- ## Security Considerations - **No new attack surface**: Streaming uses same authentication/authorization - **Rate limiting**: Apply per-token or per-chunk rate limits if needed - **Connection handling**: Properly terminate streams on client disconnect - **Timeout management**: Streaming requests need appropriate timeout handling ## Performance Considerations - **Memory**: Streaming reduces peak memory usage (no full response buffering) - **Latency**: Time-to-first-token significantly reduced - **Connection overhead**: SSE/WebSocket connections have keep-alive overhead - **Pulsar throughput**: Multiple small messages vs. single large message tradeoff ## Testing Strategy ### Unit Tests - Schema serialization/deserialization with new fields - Backward compatibility (missing fields use defaults) - Chunk assembly logic ### Integration Tests - Each LLM provider's streaming implementation - Gateway API streaming endpoints - Python client streaming methods ### End-to-End Tests - CLI tool streaming output - Full flow: Client → Gateway → Pulsar → LLM → back - Mixed streaming/non-streaming workloads ### Backward Compatibility Tests - Existing clients work without modification - Non-streaming requests behave identically ## Migration Plan ### Phase 1: Infrastructure - Deploy schema changes (backward compatible) - Deploy Gateway API updates - Deploy Python API updates - Release CLI tool updates ### Phase 2: VertexAI - Deploy VertexAI streaming implementation - Validate with test workloads ### Phase 3: All Providers - Roll out provider updates incrementally - Monitor for issues ### Phase 4: Agent API - Deploy agent schema changes - Deploy agent streaming implementation - Update documentation ## Timeline | Phase | Description | Dependencies | |-------|-------------|--------------| | Phase 1 | Infrastructure | None | | Phase 2 | VertexAI PoC | Phase 1 | | Phase 3 | All Providers | Phase 2 | | Phase 4 | Agent API | Phase 3 | ## Design Decisions The following questions were resolved during specification: 1. **Token Counts in Streaming**: Token counts are deltas, not running totals. Consumers can sum them if needed. This matches how most providers report usage and simplifies the implementation. 2. **Error Handling in Streams**: If an error occurs, the `error` field is populated and no other fields are needed. An error is always the final communication - no subsequent messages are permitted or expected after an error. For LLM/Prompt streams, `end_of_stream=true`. For Agent streams, `chunk_type="error"` with `end_of_dialog=true`. 3. **Partial Response Recovery**: The messaging protocol (Pulsar) is resilient, so message-level retry is not needed. If a client loses track of the stream or disconnects, it must retry the full request from scratch. 4. **Prompt Service Streaming**: Streaming is only supported for text (`text`) responses, not structured (`object`) responses. The prompt service knows at the outset whether the output will be JSON or text based on the prompt template. If a streaming request is made for a JSON-output prompt, the service should either: - Return the complete JSON in a single response with `end_of_stream=true`, or - Reject the streaming request with an error ## Open Questions None at this time. ## References - Current LLM schema: `trustgraph-base/trustgraph/schema/services/llm.py` - Current prompt schema: `trustgraph-base/trustgraph/schema/services/prompt.py` - Current agent schema: `trustgraph-base/trustgraph/schema/services/agent.py` - LLM service base: `trustgraph-base/trustgraph/base/llm_service.py` - VertexAI provider: `trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py` - Gateway API: `trustgraph-base/trustgraph/api/` - CLI tools: `trustgraph-cli/trustgraph/cli/`