mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 08:26:21 +02:00
Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.
576 lines
18 KiB
Markdown
576 lines
18 KiB
Markdown
---
|
|
layout: default
|
|
title: "Streaming LLM Responses Technical Specification"
|
|
parent: "Tech Specs"
|
|
---
|
|
|
|
# Streaming LLM Responses Technical Specification
|
|
|
|
## Overview
|
|
|
|
This specification describes the implementation of streaming support for LLM
|
|
responses in TrustGraph. Streaming enables real-time delivery of generated
|
|
tokens as they are produced by the LLM, rather than waiting for complete
|
|
response generation.
|
|
|
|
This implementation supports the following use cases:
|
|
|
|
1. **Real-time User Interfaces**: Stream tokens to UI as they are generated,
|
|
providing immediate visual feedback
|
|
2. **Reduced Time-to-First-Token**: Users see output beginning immediately
|
|
rather than waiting for full generation
|
|
3. **Long Response Handling**: Handle very long outputs that might otherwise
|
|
timeout or exceed memory limits
|
|
4. **Interactive Applications**: Enable responsive chat and agent interfaces
|
|
|
|
## Goals
|
|
|
|
- **Backward Compatibility**: Existing non-streaming clients continue to work
|
|
without modification
|
|
- **Consistent API Design**: Streaming and non-streaming use the same schema
|
|
patterns with minimal divergence
|
|
- **Provider Flexibility**: Support streaming where available, graceful
|
|
fallback where not
|
|
- **Phased Rollout**: Incremental implementation to reduce risk
|
|
- **End-to-End Support**: Streaming from LLM provider through to client
|
|
applications via Pulsar, Gateway API, and Python API
|
|
|
|
## Background
|
|
|
|
### Current Architecture
|
|
|
|
The current LLM text completion flow operates as follows:
|
|
|
|
1. Client sends `TextCompletionRequest` with `system` and `prompt` fields
|
|
2. LLM service processes the request and waits for complete generation
|
|
3. Single `TextCompletionResponse` returned with complete `response` string
|
|
|
|
Current schema (`trustgraph-base/trustgraph/schema/services/llm.py`):
|
|
|
|
```python
|
|
class TextCompletionRequest(Record):
|
|
system = String()
|
|
prompt = String()
|
|
|
|
class TextCompletionResponse(Record):
|
|
error = Error()
|
|
response = String()
|
|
in_token = Integer()
|
|
out_token = Integer()
|
|
model = String()
|
|
```
|
|
|
|
### Current Limitations
|
|
|
|
- **Latency**: Users must wait for complete generation before seeing any output
|
|
- **Timeout Risk**: Long generations may exceed client timeout thresholds
|
|
- **Poor UX**: No feedback during generation creates perception of slowness
|
|
- **Resource Usage**: Full responses must be buffered in memory
|
|
|
|
This specification addresses these limitations by enabling incremental response
|
|
delivery while maintaining full backward compatibility.
|
|
|
|
## Technical Design
|
|
|
|
### Phase 1: Infrastructure
|
|
|
|
Phase 1 establishes the foundation for streaming by modifying schemas, APIs,
|
|
and CLI tools.
|
|
|
|
#### Schema Changes
|
|
|
|
##### LLM Schema (`trustgraph-base/trustgraph/schema/services/llm.py`)
|
|
|
|
**Request Changes:**
|
|
|
|
```python
|
|
class TextCompletionRequest(Record):
|
|
system = String()
|
|
prompt = String()
|
|
streaming = Boolean() # NEW: Default false for backward compatibility
|
|
```
|
|
|
|
- `streaming`: When `true`, requests streaming response delivery
|
|
- Default: `false` (existing behavior preserved)
|
|
|
|
**Response Changes:**
|
|
|
|
```python
|
|
class TextCompletionResponse(Record):
|
|
error = Error()
|
|
response = String()
|
|
in_token = Integer()
|
|
out_token = Integer()
|
|
model = String()
|
|
end_of_stream = Boolean() # NEW: Indicates final message
|
|
```
|
|
|
|
- `end_of_stream`: When `true`, indicates this is the final (or only) response
|
|
- For non-streaming requests: Single response with `end_of_stream=true`
|
|
- For streaming requests: Multiple responses, all with `end_of_stream=false`
|
|
except the final one
|
|
|
|
##### Prompt Schema (`trustgraph-base/trustgraph/schema/services/prompt.py`)
|
|
|
|
The prompt service wraps text completion, so it mirrors the same pattern:
|
|
|
|
**Request Changes:**
|
|
|
|
```python
|
|
class PromptRequest(Record):
|
|
id = String()
|
|
terms = Map(String())
|
|
streaming = Boolean() # NEW: Default false
|
|
```
|
|
|
|
**Response Changes:**
|
|
|
|
```python
|
|
class PromptResponse(Record):
|
|
error = Error()
|
|
text = String()
|
|
object = String()
|
|
end_of_stream = Boolean() # NEW: Indicates final message
|
|
```
|
|
|
|
#### Gateway API Changes
|
|
|
|
The Gateway API must expose streaming capabilities to HTTP/WebSocket clients.
|
|
|
|
**REST API Updates:**
|
|
|
|
- `POST /api/v1/text-completion`: Accept `streaming` parameter in request body
|
|
- Response behavior depends on streaming flag:
|
|
- `streaming=false`: Single JSON response (current behavior)
|
|
- `streaming=true`: Server-Sent Events (SSE) stream or WebSocket messages
|
|
|
|
**Response Format (Streaming):**
|
|
|
|
Each streamed chunk follows the same schema structure:
|
|
```json
|
|
{
|
|
"response": "partial text...",
|
|
"end_of_stream": false,
|
|
"model": "model-name"
|
|
}
|
|
```
|
|
|
|
Final chunk:
|
|
```json
|
|
{
|
|
"response": "final text chunk",
|
|
"end_of_stream": true,
|
|
"in_token": 150,
|
|
"out_token": 500,
|
|
"model": "model-name"
|
|
}
|
|
```
|
|
|
|
#### Python API Changes
|
|
|
|
The Python client API must support both streaming and non-streaming modes
|
|
while maintaining backward compatibility.
|
|
|
|
**LlmClient Updates** (`trustgraph-base/trustgraph/clients/llm_client.py`):
|
|
|
|
```python
|
|
class LlmClient(BaseClient):
|
|
def request(self, system, prompt, timeout=300, streaming=False):
|
|
"""
|
|
Non-streaming request (backward compatible).
|
|
Returns complete response string.
|
|
"""
|
|
# Existing behavior when streaming=False
|
|
|
|
async def request_stream(self, system, prompt, timeout=300):
|
|
"""
|
|
Streaming request.
|
|
Yields response chunks as they arrive.
|
|
"""
|
|
# New async generator method
|
|
```
|
|
|
|
**PromptClient Updates** (`trustgraph-base/trustgraph/base/prompt_client.py`):
|
|
|
|
Similar pattern with `streaming` parameter and async generator variant.
|
|
|
|
#### CLI Tool Changes
|
|
|
|
**tg-invoke-llm** (`trustgraph-cli/trustgraph/cli/invoke_llm.py`):
|
|
|
|
```
|
|
tg-invoke-llm [system] [prompt] [--no-streaming] [-u URL] [-f flow-id]
|
|
```
|
|
|
|
- Streaming enabled by default for better interactive UX
|
|
- `--no-streaming` flag disables streaming
|
|
- When streaming: Output tokens to stdout as they arrive
|
|
- When not streaming: Wait for complete response, then output
|
|
|
|
**tg-invoke-prompt** (`trustgraph-cli/trustgraph/cli/invoke_prompt.py`):
|
|
|
|
```
|
|
tg-invoke-prompt [template-id] [var=value...] [--no-streaming] [-u URL] [-f flow-id]
|
|
```
|
|
|
|
Same pattern as `tg-invoke-llm`.
|
|
|
|
#### LLM Service Base Class Changes
|
|
|
|
**LlmService** (`trustgraph-base/trustgraph/base/llm_service.py`):
|
|
|
|
```python
|
|
class LlmService(FlowProcessor):
|
|
async def on_request(self, msg, consumer, flow):
|
|
request = msg.value()
|
|
streaming = getattr(request, 'streaming', False)
|
|
|
|
if streaming and self.supports_streaming():
|
|
async for chunk in self.generate_content_stream(...):
|
|
await self.send_response(chunk, end_of_stream=False)
|
|
await self.send_response(final_chunk, end_of_stream=True)
|
|
else:
|
|
response = await self.generate_content(...)
|
|
await self.send_response(response, end_of_stream=True)
|
|
|
|
def supports_streaming(self):
|
|
"""Override in subclass to indicate streaming support."""
|
|
return False
|
|
|
|
async def generate_content_stream(self, system, prompt, model, temperature):
|
|
"""Override in subclass to implement streaming."""
|
|
raise NotImplementedError()
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 2: VertexAI Proof of Concept
|
|
|
|
Phase 2 implements streaming in a single provider (VertexAI) to validate the
|
|
infrastructure and enable end-to-end testing.
|
|
|
|
#### VertexAI Implementation
|
|
|
|
**Module:** `trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py`
|
|
|
|
**Changes:**
|
|
|
|
1. Override `supports_streaming()` to return `True`
|
|
2. Implement `generate_content_stream()` async generator
|
|
3. Handle both Gemini and Claude models (via VertexAI Anthropic API)
|
|
|
|
**Gemini Streaming:**
|
|
|
|
```python
|
|
async def generate_content_stream(self, system, prompt, model, temperature):
|
|
model_instance = self.get_model(model, temperature)
|
|
response = model_instance.generate_content(
|
|
[system, prompt],
|
|
stream=True # Enable streaming
|
|
)
|
|
for chunk in response:
|
|
yield LlmChunk(
|
|
text=chunk.text,
|
|
in_token=None, # Available only in final chunk
|
|
out_token=None,
|
|
)
|
|
# Final chunk includes token counts from response.usage_metadata
|
|
```
|
|
|
|
**Claude (via VertexAI Anthropic) Streaming:**
|
|
|
|
```python
|
|
async def generate_content_stream(self, system, prompt, model, temperature):
|
|
with self.anthropic_client.messages.stream(...) as stream:
|
|
for text in stream.text_stream:
|
|
yield LlmChunk(text=text)
|
|
# Token counts from stream.get_final_message()
|
|
```
|
|
|
|
#### Testing
|
|
|
|
- Unit tests for streaming response assembly
|
|
- Integration tests with VertexAI (Gemini and Claude)
|
|
- End-to-end tests: CLI -> Gateway -> Pulsar -> VertexAI -> back
|
|
- Backward compatibility tests: Non-streaming requests still work
|
|
|
|
---
|
|
|
|
### Phase 3: All LLM Providers
|
|
|
|
Phase 3 extends streaming support to all LLM providers in the system.
|
|
|
|
#### Provider Implementation Status
|
|
|
|
Each provider must either:
|
|
1. **Full Streaming Support**: Implement `generate_content_stream()`
|
|
2. **Compatibility Mode**: Handle the `end_of_stream` flag correctly
|
|
(return single response with `end_of_stream=true`)
|
|
|
|
| Provider | Package | Streaming Support |
|
|
|----------|---------|-------------------|
|
|
| OpenAI | trustgraph-flow | Full (native streaming API) |
|
|
| Claude/Anthropic | trustgraph-flow | Full (native streaming API) |
|
|
| Ollama | trustgraph-flow | Full (native streaming API) |
|
|
| Cohere | trustgraph-flow | Full (native streaming API) |
|
|
| Mistral | trustgraph-flow | Full (native streaming API) |
|
|
| Azure OpenAI | trustgraph-flow | Full (native streaming API) |
|
|
| Google AI Studio | trustgraph-flow | Full (native streaming API) |
|
|
| VertexAI | trustgraph-vertexai | Full (Phase 2) |
|
|
| Bedrock | trustgraph-bedrock | Full (native streaming API) |
|
|
| LM Studio | trustgraph-flow | Full (OpenAI-compatible) |
|
|
| LlamaFile | trustgraph-flow | Full (OpenAI-compatible) |
|
|
| vLLM | trustgraph-flow | Full (OpenAI-compatible) |
|
|
| TGI | trustgraph-flow | TBD |
|
|
| Azure | trustgraph-flow | TBD |
|
|
|
|
#### Implementation Pattern
|
|
|
|
For OpenAI-compatible providers (OpenAI, LM Studio, LlamaFile, vLLM):
|
|
|
|
```python
|
|
async def generate_content_stream(self, system, prompt, model, temperature):
|
|
response = await self.client.chat.completions.create(
|
|
model=model,
|
|
messages=[
|
|
{"role": "system", "content": system},
|
|
{"role": "user", "content": prompt}
|
|
],
|
|
temperature=temperature,
|
|
stream=True
|
|
)
|
|
async for chunk in response:
|
|
if chunk.choices[0].delta.content:
|
|
yield LlmChunk(text=chunk.choices[0].delta.content)
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 4: Agent API
|
|
|
|
Phase 4 extends streaming to the Agent API. This is more complex because the
|
|
Agent API is already multi-message by nature (thought → action → observation
|
|
→ repeat → final answer).
|
|
|
|
#### Current Agent Schema
|
|
|
|
```python
|
|
class AgentStep(Record):
|
|
thought = String()
|
|
action = String()
|
|
arguments = Map(String())
|
|
observation = String()
|
|
user = String()
|
|
|
|
class AgentRequest(Record):
|
|
question = String()
|
|
state = String()
|
|
group = Array(String())
|
|
history = Array(AgentStep())
|
|
user = String()
|
|
|
|
class AgentResponse(Record):
|
|
answer = String()
|
|
error = Error()
|
|
thought = String()
|
|
observation = String()
|
|
```
|
|
|
|
#### Proposed Agent Schema Changes
|
|
|
|
**Request Changes:**
|
|
|
|
```python
|
|
class AgentRequest(Record):
|
|
question = String()
|
|
state = String()
|
|
group = Array(String())
|
|
history = Array(AgentStep())
|
|
user = String()
|
|
streaming = Boolean() # NEW: Default false
|
|
```
|
|
|
|
**Response Changes:**
|
|
|
|
The agent produces multiple types of output during its reasoning cycle:
|
|
- Thoughts (reasoning)
|
|
- Actions (tool calls)
|
|
- Observations (tool results)
|
|
- Answer (final response)
|
|
- Errors
|
|
|
|
Since `message_type` identifies what kind of content is being sent, the separate
|
|
`answer`, `error`, `thought`, and `observation` fields can be collapsed into
|
|
a single `content` field:
|
|
|
|
```python
|
|
class AgentResponse(Record):
|
|
message_type = String() # "thought", "action", "observation", "answer", "error"
|
|
content = String() # The actual content (interpretation depends on message_type)
|
|
end_of_message = Boolean() # Current thought/action/observation/answer is complete
|
|
end_of_dialog = Boolean() # Entire agent dialog is complete
|
|
```
|
|
|
|
**Field Semantics:**
|
|
|
|
- `message_type`: Indicates what type of content is in the `content` field
|
|
- `"thought"`: Agent reasoning/thinking
|
|
- `"action"`: Tool/action being invoked
|
|
- `"observation"`: Result from tool execution
|
|
- `"answer"`: Final answer to the user's question
|
|
- `"error"`: Error message
|
|
|
|
- `content`: The actual streamed content, interpreted based on `message_type`
|
|
|
|
- `end_of_message`: When `true`, the current chunk type is complete
|
|
- Example: All tokens for the current thought have been sent
|
|
- Allows clients to know when to move to the next stage
|
|
|
|
- `end_of_dialog`: When `true`, the entire agent interaction is complete
|
|
- This is the final message in the stream
|
|
|
|
#### Agent Streaming Behavior
|
|
|
|
When `streaming=true`:
|
|
|
|
1. **Thought streaming**:
|
|
- Multiple chunks with `message_type="thought"`, `end_of_message=false`
|
|
- Final thought chunk has `end_of_message=true`
|
|
2. **Action notification**:
|
|
- Single chunk with `message_type="action"`, `end_of_message=true`
|
|
3. **Observation**:
|
|
- Chunk(s) with `message_type="observation"`, final has `end_of_message=true`
|
|
4. **Repeat** steps 1-3 as the agent reasons
|
|
5. **Final answer**:
|
|
- `message_type="answer"` with the final response in `content`
|
|
- Last chunk has `end_of_message=true`, `end_of_dialog=true`
|
|
|
|
**Example Stream Sequence:**
|
|
|
|
```
|
|
{message_type: "thought", content: "I need to", end_of_message: false, end_of_dialog: false}
|
|
{message_type: "thought", content: " search for...", end_of_message: true, end_of_dialog: false}
|
|
{message_type: "action", content: "search", end_of_message: true, end_of_dialog: false}
|
|
{message_type: "observation", content: "Found: ...", end_of_message: true, end_of_dialog: false}
|
|
{message_type: "thought", content: "Based on this", end_of_message: false, end_of_dialog: false}
|
|
{message_type: "thought", content: " I can answer...", end_of_message: true, end_of_dialog: false}
|
|
{message_type: "answer", content: "The answer is...", end_of_message: true, end_of_dialog: true}
|
|
```
|
|
|
|
When `streaming=false`:
|
|
- Current behavior preserved
|
|
- Single response with complete answer
|
|
- `end_of_message=true`, `end_of_dialog=true`
|
|
|
|
#### Gateway and Python API
|
|
|
|
- Gateway: New SSE/WebSocket endpoint for agent streaming
|
|
- Python API: New `agent_stream()` async generator method
|
|
|
|
---
|
|
|
|
## Security Considerations
|
|
|
|
- **No new attack surface**: Streaming uses same authentication/authorization
|
|
- **Rate limiting**: Apply per-token or per-chunk rate limits if needed
|
|
- **Connection handling**: Properly terminate streams on client disconnect
|
|
- **Timeout management**: Streaming requests need appropriate timeout handling
|
|
|
|
## Performance Considerations
|
|
|
|
- **Memory**: Streaming reduces peak memory usage (no full response buffering)
|
|
- **Latency**: Time-to-first-token significantly reduced
|
|
- **Connection overhead**: SSE/WebSocket connections have keep-alive overhead
|
|
- **Pulsar throughput**: Multiple small messages vs. single large message
|
|
tradeoff
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests
|
|
- Schema serialization/deserialization with new fields
|
|
- Backward compatibility (missing fields use defaults)
|
|
- Chunk assembly logic
|
|
|
|
### Integration Tests
|
|
- Each LLM provider's streaming implementation
|
|
- Gateway API streaming endpoints
|
|
- Python client streaming methods
|
|
|
|
### End-to-End Tests
|
|
- CLI tool streaming output
|
|
- Full flow: Client → Gateway → Pulsar → LLM → back
|
|
- Mixed streaming/non-streaming workloads
|
|
|
|
### Backward Compatibility Tests
|
|
- Existing clients work without modification
|
|
- Non-streaming requests behave identically
|
|
|
|
## Migration Plan
|
|
|
|
### Phase 1: Infrastructure
|
|
- Deploy schema changes (backward compatible)
|
|
- Deploy Gateway API updates
|
|
- Deploy Python API updates
|
|
- Release CLI tool updates
|
|
|
|
### Phase 2: VertexAI
|
|
- Deploy VertexAI streaming implementation
|
|
- Validate with test workloads
|
|
|
|
### Phase 3: All Providers
|
|
- Roll out provider updates incrementally
|
|
- Monitor for issues
|
|
|
|
### Phase 4: Agent API
|
|
- Deploy agent schema changes
|
|
- Deploy agent streaming implementation
|
|
- Update documentation
|
|
|
|
## Timeline
|
|
|
|
| Phase | Description | Dependencies |
|
|
|-------|-------------|--------------|
|
|
| Phase 1 | Infrastructure | None |
|
|
| Phase 2 | VertexAI PoC | Phase 1 |
|
|
| Phase 3 | All Providers | Phase 2 |
|
|
| Phase 4 | Agent API | Phase 3 |
|
|
|
|
## Design Decisions
|
|
|
|
The following questions were resolved during specification:
|
|
|
|
1. **Token Counts in Streaming**: Token counts are deltas, not running totals.
|
|
Consumers can sum them if needed. This matches how most providers report
|
|
usage and simplifies the implementation.
|
|
|
|
2. **Error Handling in Streams**: If an error occurs, the `error` field is
|
|
populated and no other fields are needed. An error is always the final
|
|
communication - no subsequent messages are permitted or expected after
|
|
an error. For LLM/Prompt streams, `end_of_stream=true`. For Agent streams,
|
|
`message_type="error"` with `end_of_dialog=true`.
|
|
|
|
3. **Partial Response Recovery**: The messaging protocol (Pulsar) is resilient,
|
|
so message-level retry is not needed. If a client loses track of the stream
|
|
or disconnects, it must retry the full request from scratch.
|
|
|
|
4. **Prompt Service Streaming**: Streaming is only supported for text (`text`)
|
|
responses, not structured (`object`) responses. The prompt service knows at
|
|
the outset whether the output will be JSON or text based on the prompt
|
|
template. If a streaming request is made for a JSON-output prompt, the
|
|
service should either:
|
|
- Return the complete JSON in a single response with `end_of_stream=true`, or
|
|
- Reject the streaming request with an error
|
|
|
|
## Open Questions
|
|
|
|
None at this time.
|
|
|
|
## References
|
|
|
|
- Current LLM schema: `trustgraph-base/trustgraph/schema/services/llm.py`
|
|
- Current prompt schema: `trustgraph-base/trustgraph/schema/services/prompt.py`
|
|
- Current agent schema: `trustgraph-base/trustgraph/schema/services/agent.py`
|
|
- LLM service base: `trustgraph-base/trustgraph/base/llm_service.py`
|
|
- VertexAI provider: `trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py`
|
|
- Gateway API: `trustgraph-base/trustgraph/api/`
|
|
- CLI tools: `trustgraph-cli/trustgraph/cli/`
|