trustgraph/docs/tech-specs/streaming-llm-responses.md
cybermaggedon 310a2deb06
Feature/streaming llm phase 1 (#566)
* Tidy up duplicate tech specs in doc directory

* Streaming LLM text-completion service tech spec.

* text-completion and prompt interfaces

* streaming change applied to all LLMs, so far tested with VertexAI

* Skip Pinecone unit tests, upstream module issue is affecting things, tests are passing again

* Added agent streaming, not working and has broken tests
2025-11-26 09:59:10 +00:00

570 lines
18 KiB
Markdown

# Streaming LLM Responses Technical Specification
## Overview
This specification describes the implementation of streaming support for LLM
responses in TrustGraph. Streaming enables real-time delivery of generated
tokens as they are produced by the LLM, rather than waiting for complete
response generation.
This implementation supports the following use cases:
1. **Real-time User Interfaces**: Stream tokens to UI as they are generated,
providing immediate visual feedback
2. **Reduced Time-to-First-Token**: Users see output beginning immediately
rather than waiting for full generation
3. **Long Response Handling**: Handle very long outputs that might otherwise
timeout or exceed memory limits
4. **Interactive Applications**: Enable responsive chat and agent interfaces
## Goals
- **Backward Compatibility**: Existing non-streaming clients continue to work
without modification
- **Consistent API Design**: Streaming and non-streaming use the same schema
patterns with minimal divergence
- **Provider Flexibility**: Support streaming where available, graceful
fallback where not
- **Phased Rollout**: Incremental implementation to reduce risk
- **End-to-End Support**: Streaming from LLM provider through to client
applications via Pulsar, Gateway API, and Python API
## Background
### Current Architecture
The current LLM text completion flow operates as follows:
1. Client sends `TextCompletionRequest` with `system` and `prompt` fields
2. LLM service processes the request and waits for complete generation
3. Single `TextCompletionResponse` returned with complete `response` string
Current schema (`trustgraph-base/trustgraph/schema/services/llm.py`):
```python
class TextCompletionRequest(Record):
system = String()
prompt = String()
class TextCompletionResponse(Record):
error = Error()
response = String()
in_token = Integer()
out_token = Integer()
model = String()
```
### Current Limitations
- **Latency**: Users must wait for complete generation before seeing any output
- **Timeout Risk**: Long generations may exceed client timeout thresholds
- **Poor UX**: No feedback during generation creates perception of slowness
- **Resource Usage**: Full responses must be buffered in memory
This specification addresses these limitations by enabling incremental response
delivery while maintaining full backward compatibility.
## Technical Design
### Phase 1: Infrastructure
Phase 1 establishes the foundation for streaming by modifying schemas, APIs,
and CLI tools.
#### Schema Changes
##### LLM Schema (`trustgraph-base/trustgraph/schema/services/llm.py`)
**Request Changes:**
```python
class TextCompletionRequest(Record):
system = String()
prompt = String()
streaming = Boolean() # NEW: Default false for backward compatibility
```
- `streaming`: When `true`, requests streaming response delivery
- Default: `false` (existing behavior preserved)
**Response Changes:**
```python
class TextCompletionResponse(Record):
error = Error()
response = String()
in_token = Integer()
out_token = Integer()
model = String()
end_of_stream = Boolean() # NEW: Indicates final message
```
- `end_of_stream`: When `true`, indicates this is the final (or only) response
- For non-streaming requests: Single response with `end_of_stream=true`
- For streaming requests: Multiple responses, all with `end_of_stream=false`
except the final one
##### Prompt Schema (`trustgraph-base/trustgraph/schema/services/prompt.py`)
The prompt service wraps text completion, so it mirrors the same pattern:
**Request Changes:**
```python
class PromptRequest(Record):
id = String()
terms = Map(String())
streaming = Boolean() # NEW: Default false
```
**Response Changes:**
```python
class PromptResponse(Record):
error = Error()
text = String()
object = String()
end_of_stream = Boolean() # NEW: Indicates final message
```
#### Gateway API Changes
The Gateway API must expose streaming capabilities to HTTP/WebSocket clients.
**REST API Updates:**
- `POST /api/v1/text-completion`: Accept `streaming` parameter in request body
- Response behavior depends on streaming flag:
- `streaming=false`: Single JSON response (current behavior)
- `streaming=true`: Server-Sent Events (SSE) stream or WebSocket messages
**Response Format (Streaming):**
Each streamed chunk follows the same schema structure:
```json
{
"response": "partial text...",
"end_of_stream": false,
"model": "model-name"
}
```
Final chunk:
```json
{
"response": "final text chunk",
"end_of_stream": true,
"in_token": 150,
"out_token": 500,
"model": "model-name"
}
```
#### Python API Changes
The Python client API must support both streaming and non-streaming modes
while maintaining backward compatibility.
**LlmClient Updates** (`trustgraph-base/trustgraph/clients/llm_client.py`):
```python
class LlmClient(BaseClient):
def request(self, system, prompt, timeout=300, streaming=False):
"""
Non-streaming request (backward compatible).
Returns complete response string.
"""
# Existing behavior when streaming=False
async def request_stream(self, system, prompt, timeout=300):
"""
Streaming request.
Yields response chunks as they arrive.
"""
# New async generator method
```
**PromptClient Updates** (`trustgraph-base/trustgraph/base/prompt_client.py`):
Similar pattern with `streaming` parameter and async generator variant.
#### CLI Tool Changes
**tg-invoke-llm** (`trustgraph-cli/trustgraph/cli/invoke_llm.py`):
```
tg-invoke-llm [system] [prompt] [--no-streaming] [-u URL] [-f flow-id]
```
- Streaming enabled by default for better interactive UX
- `--no-streaming` flag disables streaming
- When streaming: Output tokens to stdout as they arrive
- When not streaming: Wait for complete response, then output
**tg-invoke-prompt** (`trustgraph-cli/trustgraph/cli/invoke_prompt.py`):
```
tg-invoke-prompt [template-id] [var=value...] [--no-streaming] [-u URL] [-f flow-id]
```
Same pattern as `tg-invoke-llm`.
#### LLM Service Base Class Changes
**LlmService** (`trustgraph-base/trustgraph/base/llm_service.py`):
```python
class LlmService(FlowProcessor):
async def on_request(self, msg, consumer, flow):
request = msg.value()
streaming = getattr(request, 'streaming', False)
if streaming and self.supports_streaming():
async for chunk in self.generate_content_stream(...):
await self.send_response(chunk, end_of_stream=False)
await self.send_response(final_chunk, end_of_stream=True)
else:
response = await self.generate_content(...)
await self.send_response(response, end_of_stream=True)
def supports_streaming(self):
"""Override in subclass to indicate streaming support."""
return False
async def generate_content_stream(self, system, prompt, model, temperature):
"""Override in subclass to implement streaming."""
raise NotImplementedError()
```
---
### Phase 2: VertexAI Proof of Concept
Phase 2 implements streaming in a single provider (VertexAI) to validate the
infrastructure and enable end-to-end testing.
#### VertexAI Implementation
**Module:** `trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py`
**Changes:**
1. Override `supports_streaming()` to return `True`
2. Implement `generate_content_stream()` async generator
3. Handle both Gemini and Claude models (via VertexAI Anthropic API)
**Gemini Streaming:**
```python
async def generate_content_stream(self, system, prompt, model, temperature):
model_instance = self.get_model(model, temperature)
response = model_instance.generate_content(
[system, prompt],
stream=True # Enable streaming
)
for chunk in response:
yield LlmChunk(
text=chunk.text,
in_token=None, # Available only in final chunk
out_token=None,
)
# Final chunk includes token counts from response.usage_metadata
```
**Claude (via VertexAI Anthropic) Streaming:**
```python
async def generate_content_stream(self, system, prompt, model, temperature):
with self.anthropic_client.messages.stream(...) as stream:
for text in stream.text_stream:
yield LlmChunk(text=text)
# Token counts from stream.get_final_message()
```
#### Testing
- Unit tests for streaming response assembly
- Integration tests with VertexAI (Gemini and Claude)
- End-to-end tests: CLI -> Gateway -> Pulsar -> VertexAI -> back
- Backward compatibility tests: Non-streaming requests still work
---
### Phase 3: All LLM Providers
Phase 3 extends streaming support to all LLM providers in the system.
#### Provider Implementation Status
Each provider must either:
1. **Full Streaming Support**: Implement `generate_content_stream()`
2. **Compatibility Mode**: Handle the `end_of_stream` flag correctly
(return single response with `end_of_stream=true`)
| Provider | Package | Streaming Support |
|----------|---------|-------------------|
| OpenAI | trustgraph-flow | Full (native streaming API) |
| Claude/Anthropic | trustgraph-flow | Full (native streaming API) |
| Ollama | trustgraph-flow | Full (native streaming API) |
| Cohere | trustgraph-flow | Full (native streaming API) |
| Mistral | trustgraph-flow | Full (native streaming API) |
| Azure OpenAI | trustgraph-flow | Full (native streaming API) |
| Google AI Studio | trustgraph-flow | Full (native streaming API) |
| VertexAI | trustgraph-vertexai | Full (Phase 2) |
| Bedrock | trustgraph-bedrock | Full (native streaming API) |
| LM Studio | trustgraph-flow | Full (OpenAI-compatible) |
| LlamaFile | trustgraph-flow | Full (OpenAI-compatible) |
| vLLM | trustgraph-flow | Full (OpenAI-compatible) |
| TGI | trustgraph-flow | TBD |
| Azure | trustgraph-flow | TBD |
#### Implementation Pattern
For OpenAI-compatible providers (OpenAI, LM Studio, LlamaFile, vLLM):
```python
async def generate_content_stream(self, system, prompt, model, temperature):
response = await self.client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": prompt}
],
temperature=temperature,
stream=True
)
async for chunk in response:
if chunk.choices[0].delta.content:
yield LlmChunk(text=chunk.choices[0].delta.content)
```
---
### Phase 4: Agent API
Phase 4 extends streaming to the Agent API. This is more complex because the
Agent API is already multi-message by nature (thought → action → observation
→ repeat → final answer).
#### Current Agent Schema
```python
class AgentStep(Record):
thought = String()
action = String()
arguments = Map(String())
observation = String()
user = String()
class AgentRequest(Record):
question = String()
state = String()
group = Array(String())
history = Array(AgentStep())
user = String()
class AgentResponse(Record):
answer = String()
error = Error()
thought = String()
observation = String()
```
#### Proposed Agent Schema Changes
**Request Changes:**
```python
class AgentRequest(Record):
question = String()
state = String()
group = Array(String())
history = Array(AgentStep())
user = String()
streaming = Boolean() # NEW: Default false
```
**Response Changes:**
The agent produces multiple types of output during its reasoning cycle:
- Thoughts (reasoning)
- Actions (tool calls)
- Observations (tool results)
- Answer (final response)
- Errors
Since `chunk_type` identifies what kind of content is being sent, the separate
`answer`, `error`, `thought`, and `observation` fields can be collapsed into
a single `content` field:
```python
class AgentResponse(Record):
chunk_type = String() # "thought", "action", "observation", "answer", "error"
content = String() # The actual content (interpretation depends on chunk_type)
end_of_message = Boolean() # Current thought/action/observation/answer is complete
end_of_dialog = Boolean() # Entire agent dialog is complete
```
**Field Semantics:**
- `chunk_type`: Indicates what type of content is in the `content` field
- `"thought"`: Agent reasoning/thinking
- `"action"`: Tool/action being invoked
- `"observation"`: Result from tool execution
- `"answer"`: Final answer to the user's question
- `"error"`: Error message
- `content`: The actual streamed content, interpreted based on `chunk_type`
- `end_of_message`: When `true`, the current chunk type is complete
- Example: All tokens for the current thought have been sent
- Allows clients to know when to move to the next stage
- `end_of_dialog`: When `true`, the entire agent interaction is complete
- This is the final message in the stream
#### Agent Streaming Behavior
When `streaming=true`:
1. **Thought streaming**:
- Multiple chunks with `chunk_type="thought"`, `end_of_message=false`
- Final thought chunk has `end_of_message=true`
2. **Action notification**:
- Single chunk with `chunk_type="action"`, `end_of_message=true`
3. **Observation**:
- Chunk(s) with `chunk_type="observation"`, final has `end_of_message=true`
4. **Repeat** steps 1-3 as the agent reasons
5. **Final answer**:
- `chunk_type="answer"` with the final response in `content`
- Last chunk has `end_of_message=true`, `end_of_dialog=true`
**Example Stream Sequence:**
```
{chunk_type: "thought", content: "I need to", end_of_message: false, end_of_dialog: false}
{chunk_type: "thought", content: " search for...", end_of_message: true, end_of_dialog: false}
{chunk_type: "action", content: "search", end_of_message: true, end_of_dialog: false}
{chunk_type: "observation", content: "Found: ...", end_of_message: true, end_of_dialog: false}
{chunk_type: "thought", content: "Based on this", end_of_message: false, end_of_dialog: false}
{chunk_type: "thought", content: " I can answer...", end_of_message: true, end_of_dialog: false}
{chunk_type: "answer", content: "The answer is...", end_of_message: true, end_of_dialog: true}
```
When `streaming=false`:
- Current behavior preserved
- Single response with complete answer
- `end_of_message=true`, `end_of_dialog=true`
#### Gateway and Python API
- Gateway: New SSE/WebSocket endpoint for agent streaming
- Python API: New `agent_stream()` async generator method
---
## Security Considerations
- **No new attack surface**: Streaming uses same authentication/authorization
- **Rate limiting**: Apply per-token or per-chunk rate limits if needed
- **Connection handling**: Properly terminate streams on client disconnect
- **Timeout management**: Streaming requests need appropriate timeout handling
## Performance Considerations
- **Memory**: Streaming reduces peak memory usage (no full response buffering)
- **Latency**: Time-to-first-token significantly reduced
- **Connection overhead**: SSE/WebSocket connections have keep-alive overhead
- **Pulsar throughput**: Multiple small messages vs. single large message
tradeoff
## Testing Strategy
### Unit Tests
- Schema serialization/deserialization with new fields
- Backward compatibility (missing fields use defaults)
- Chunk assembly logic
### Integration Tests
- Each LLM provider's streaming implementation
- Gateway API streaming endpoints
- Python client streaming methods
### End-to-End Tests
- CLI tool streaming output
- Full flow: Client → Gateway → Pulsar → LLM → back
- Mixed streaming/non-streaming workloads
### Backward Compatibility Tests
- Existing clients work without modification
- Non-streaming requests behave identically
## Migration Plan
### Phase 1: Infrastructure
- Deploy schema changes (backward compatible)
- Deploy Gateway API updates
- Deploy Python API updates
- Release CLI tool updates
### Phase 2: VertexAI
- Deploy VertexAI streaming implementation
- Validate with test workloads
### Phase 3: All Providers
- Roll out provider updates incrementally
- Monitor for issues
### Phase 4: Agent API
- Deploy agent schema changes
- Deploy agent streaming implementation
- Update documentation
## Timeline
| Phase | Description | Dependencies |
|-------|-------------|--------------|
| Phase 1 | Infrastructure | None |
| Phase 2 | VertexAI PoC | Phase 1 |
| Phase 3 | All Providers | Phase 2 |
| Phase 4 | Agent API | Phase 3 |
## Design Decisions
The following questions were resolved during specification:
1. **Token Counts in Streaming**: Token counts are deltas, not running totals.
Consumers can sum them if needed. This matches how most providers report
usage and simplifies the implementation.
2. **Error Handling in Streams**: If an error occurs, the `error` field is
populated and no other fields are needed. An error is always the final
communication - no subsequent messages are permitted or expected after
an error. For LLM/Prompt streams, `end_of_stream=true`. For Agent streams,
`chunk_type="error"` with `end_of_dialog=true`.
3. **Partial Response Recovery**: The messaging protocol (Pulsar) is resilient,
so message-level retry is not needed. If a client loses track of the stream
or disconnects, it must retry the full request from scratch.
4. **Prompt Service Streaming**: Streaming is only supported for text (`text`)
responses, not structured (`object`) responses. The prompt service knows at
the outset whether the output will be JSON or text based on the prompt
template. If a streaming request is made for a JSON-output prompt, the
service should either:
- Return the complete JSON in a single response with `end_of_stream=true`, or
- Reject the streaming request with an error
## Open Questions
None at this time.
## References
- Current LLM schema: `trustgraph-base/trustgraph/schema/services/llm.py`
- Current prompt schema: `trustgraph-base/trustgraph/schema/services/prompt.py`
- Current agent schema: `trustgraph-base/trustgraph/schema/services/agent.py`
- LLM service base: `trustgraph-base/trustgraph/base/llm_service.py`
- VertexAI provider: `trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py`
- Gateway API: `trustgraph-base/trustgraph/api/`
- CLI tools: `trustgraph-cli/trustgraph/cli/`