mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 08:26:21 +02:00
* Tidy up duplicate tech specs in doc directory * Streaming LLM text-completion service tech spec. * text-completion and prompt interfaces * streaming change applied to all LLMs, so far tested with VertexAI * Skip Pinecone unit tests, upstream module issue is affecting things, tests are passing again * Added agent streaming, not working and has broken tests
570 lines
18 KiB
Markdown
570 lines
18 KiB
Markdown
# Streaming LLM Responses Technical Specification
|
|
|
|
## Overview
|
|
|
|
This specification describes the implementation of streaming support for LLM
|
|
responses in TrustGraph. Streaming enables real-time delivery of generated
|
|
tokens as they are produced by the LLM, rather than waiting for complete
|
|
response generation.
|
|
|
|
This implementation supports the following use cases:
|
|
|
|
1. **Real-time User Interfaces**: Stream tokens to UI as they are generated,
|
|
providing immediate visual feedback
|
|
2. **Reduced Time-to-First-Token**: Users see output beginning immediately
|
|
rather than waiting for full generation
|
|
3. **Long Response Handling**: Handle very long outputs that might otherwise
|
|
timeout or exceed memory limits
|
|
4. **Interactive Applications**: Enable responsive chat and agent interfaces
|
|
|
|
## Goals
|
|
|
|
- **Backward Compatibility**: Existing non-streaming clients continue to work
|
|
without modification
|
|
- **Consistent API Design**: Streaming and non-streaming use the same schema
|
|
patterns with minimal divergence
|
|
- **Provider Flexibility**: Support streaming where available, graceful
|
|
fallback where not
|
|
- **Phased Rollout**: Incremental implementation to reduce risk
|
|
- **End-to-End Support**: Streaming from LLM provider through to client
|
|
applications via Pulsar, Gateway API, and Python API
|
|
|
|
## Background
|
|
|
|
### Current Architecture
|
|
|
|
The current LLM text completion flow operates as follows:
|
|
|
|
1. Client sends `TextCompletionRequest` with `system` and `prompt` fields
|
|
2. LLM service processes the request and waits for complete generation
|
|
3. Single `TextCompletionResponse` returned with complete `response` string
|
|
|
|
Current schema (`trustgraph-base/trustgraph/schema/services/llm.py`):
|
|
|
|
```python
|
|
class TextCompletionRequest(Record):
|
|
system = String()
|
|
prompt = String()
|
|
|
|
class TextCompletionResponse(Record):
|
|
error = Error()
|
|
response = String()
|
|
in_token = Integer()
|
|
out_token = Integer()
|
|
model = String()
|
|
```
|
|
|
|
### Current Limitations
|
|
|
|
- **Latency**: Users must wait for complete generation before seeing any output
|
|
- **Timeout Risk**: Long generations may exceed client timeout thresholds
|
|
- **Poor UX**: No feedback during generation creates perception of slowness
|
|
- **Resource Usage**: Full responses must be buffered in memory
|
|
|
|
This specification addresses these limitations by enabling incremental response
|
|
delivery while maintaining full backward compatibility.
|
|
|
|
## Technical Design
|
|
|
|
### Phase 1: Infrastructure
|
|
|
|
Phase 1 establishes the foundation for streaming by modifying schemas, APIs,
|
|
and CLI tools.
|
|
|
|
#### Schema Changes
|
|
|
|
##### LLM Schema (`trustgraph-base/trustgraph/schema/services/llm.py`)
|
|
|
|
**Request Changes:**
|
|
|
|
```python
|
|
class TextCompletionRequest(Record):
|
|
system = String()
|
|
prompt = String()
|
|
streaming = Boolean() # NEW: Default false for backward compatibility
|
|
```
|
|
|
|
- `streaming`: When `true`, requests streaming response delivery
|
|
- Default: `false` (existing behavior preserved)
|
|
|
|
**Response Changes:**
|
|
|
|
```python
|
|
class TextCompletionResponse(Record):
|
|
error = Error()
|
|
response = String()
|
|
in_token = Integer()
|
|
out_token = Integer()
|
|
model = String()
|
|
end_of_stream = Boolean() # NEW: Indicates final message
|
|
```
|
|
|
|
- `end_of_stream`: When `true`, indicates this is the final (or only) response
|
|
- For non-streaming requests: Single response with `end_of_stream=true`
|
|
- For streaming requests: Multiple responses, all with `end_of_stream=false`
|
|
except the final one
|
|
|
|
##### Prompt Schema (`trustgraph-base/trustgraph/schema/services/prompt.py`)
|
|
|
|
The prompt service wraps text completion, so it mirrors the same pattern:
|
|
|
|
**Request Changes:**
|
|
|
|
```python
|
|
class PromptRequest(Record):
|
|
id = String()
|
|
terms = Map(String())
|
|
streaming = Boolean() # NEW: Default false
|
|
```
|
|
|
|
**Response Changes:**
|
|
|
|
```python
|
|
class PromptResponse(Record):
|
|
error = Error()
|
|
text = String()
|
|
object = String()
|
|
end_of_stream = Boolean() # NEW: Indicates final message
|
|
```
|
|
|
|
#### Gateway API Changes
|
|
|
|
The Gateway API must expose streaming capabilities to HTTP/WebSocket clients.
|
|
|
|
**REST API Updates:**
|
|
|
|
- `POST /api/v1/text-completion`: Accept `streaming` parameter in request body
|
|
- Response behavior depends on streaming flag:
|
|
- `streaming=false`: Single JSON response (current behavior)
|
|
- `streaming=true`: Server-Sent Events (SSE) stream or WebSocket messages
|
|
|
|
**Response Format (Streaming):**
|
|
|
|
Each streamed chunk follows the same schema structure:
|
|
```json
|
|
{
|
|
"response": "partial text...",
|
|
"end_of_stream": false,
|
|
"model": "model-name"
|
|
}
|
|
```
|
|
|
|
Final chunk:
|
|
```json
|
|
{
|
|
"response": "final text chunk",
|
|
"end_of_stream": true,
|
|
"in_token": 150,
|
|
"out_token": 500,
|
|
"model": "model-name"
|
|
}
|
|
```
|
|
|
|
#### Python API Changes
|
|
|
|
The Python client API must support both streaming and non-streaming modes
|
|
while maintaining backward compatibility.
|
|
|
|
**LlmClient Updates** (`trustgraph-base/trustgraph/clients/llm_client.py`):
|
|
|
|
```python
|
|
class LlmClient(BaseClient):
|
|
def request(self, system, prompt, timeout=300, streaming=False):
|
|
"""
|
|
Non-streaming request (backward compatible).
|
|
Returns complete response string.
|
|
"""
|
|
# Existing behavior when streaming=False
|
|
|
|
async def request_stream(self, system, prompt, timeout=300):
|
|
"""
|
|
Streaming request.
|
|
Yields response chunks as they arrive.
|
|
"""
|
|
# New async generator method
|
|
```
|
|
|
|
**PromptClient Updates** (`trustgraph-base/trustgraph/base/prompt_client.py`):
|
|
|
|
Similar pattern with `streaming` parameter and async generator variant.
|
|
|
|
#### CLI Tool Changes
|
|
|
|
**tg-invoke-llm** (`trustgraph-cli/trustgraph/cli/invoke_llm.py`):
|
|
|
|
```
|
|
tg-invoke-llm [system] [prompt] [--no-streaming] [-u URL] [-f flow-id]
|
|
```
|
|
|
|
- Streaming enabled by default for better interactive UX
|
|
- `--no-streaming` flag disables streaming
|
|
- When streaming: Output tokens to stdout as they arrive
|
|
- When not streaming: Wait for complete response, then output
|
|
|
|
**tg-invoke-prompt** (`trustgraph-cli/trustgraph/cli/invoke_prompt.py`):
|
|
|
|
```
|
|
tg-invoke-prompt [template-id] [var=value...] [--no-streaming] [-u URL] [-f flow-id]
|
|
```
|
|
|
|
Same pattern as `tg-invoke-llm`.
|
|
|
|
#### LLM Service Base Class Changes
|
|
|
|
**LlmService** (`trustgraph-base/trustgraph/base/llm_service.py`):
|
|
|
|
```python
|
|
class LlmService(FlowProcessor):
|
|
async def on_request(self, msg, consumer, flow):
|
|
request = msg.value()
|
|
streaming = getattr(request, 'streaming', False)
|
|
|
|
if streaming and self.supports_streaming():
|
|
async for chunk in self.generate_content_stream(...):
|
|
await self.send_response(chunk, end_of_stream=False)
|
|
await self.send_response(final_chunk, end_of_stream=True)
|
|
else:
|
|
response = await self.generate_content(...)
|
|
await self.send_response(response, end_of_stream=True)
|
|
|
|
def supports_streaming(self):
|
|
"""Override in subclass to indicate streaming support."""
|
|
return False
|
|
|
|
async def generate_content_stream(self, system, prompt, model, temperature):
|
|
"""Override in subclass to implement streaming."""
|
|
raise NotImplementedError()
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 2: VertexAI Proof of Concept
|
|
|
|
Phase 2 implements streaming in a single provider (VertexAI) to validate the
|
|
infrastructure and enable end-to-end testing.
|
|
|
|
#### VertexAI Implementation
|
|
|
|
**Module:** `trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py`
|
|
|
|
**Changes:**
|
|
|
|
1. Override `supports_streaming()` to return `True`
|
|
2. Implement `generate_content_stream()` async generator
|
|
3. Handle both Gemini and Claude models (via VertexAI Anthropic API)
|
|
|
|
**Gemini Streaming:**
|
|
|
|
```python
|
|
async def generate_content_stream(self, system, prompt, model, temperature):
|
|
model_instance = self.get_model(model, temperature)
|
|
response = model_instance.generate_content(
|
|
[system, prompt],
|
|
stream=True # Enable streaming
|
|
)
|
|
for chunk in response:
|
|
yield LlmChunk(
|
|
text=chunk.text,
|
|
in_token=None, # Available only in final chunk
|
|
out_token=None,
|
|
)
|
|
# Final chunk includes token counts from response.usage_metadata
|
|
```
|
|
|
|
**Claude (via VertexAI Anthropic) Streaming:**
|
|
|
|
```python
|
|
async def generate_content_stream(self, system, prompt, model, temperature):
|
|
with self.anthropic_client.messages.stream(...) as stream:
|
|
for text in stream.text_stream:
|
|
yield LlmChunk(text=text)
|
|
# Token counts from stream.get_final_message()
|
|
```
|
|
|
|
#### Testing
|
|
|
|
- Unit tests for streaming response assembly
|
|
- Integration tests with VertexAI (Gemini and Claude)
|
|
- End-to-end tests: CLI -> Gateway -> Pulsar -> VertexAI -> back
|
|
- Backward compatibility tests: Non-streaming requests still work
|
|
|
|
---
|
|
|
|
### Phase 3: All LLM Providers
|
|
|
|
Phase 3 extends streaming support to all LLM providers in the system.
|
|
|
|
#### Provider Implementation Status
|
|
|
|
Each provider must either:
|
|
1. **Full Streaming Support**: Implement `generate_content_stream()`
|
|
2. **Compatibility Mode**: Handle the `end_of_stream` flag correctly
|
|
(return single response with `end_of_stream=true`)
|
|
|
|
| Provider | Package | Streaming Support |
|
|
|----------|---------|-------------------|
|
|
| OpenAI | trustgraph-flow | Full (native streaming API) |
|
|
| Claude/Anthropic | trustgraph-flow | Full (native streaming API) |
|
|
| Ollama | trustgraph-flow | Full (native streaming API) |
|
|
| Cohere | trustgraph-flow | Full (native streaming API) |
|
|
| Mistral | trustgraph-flow | Full (native streaming API) |
|
|
| Azure OpenAI | trustgraph-flow | Full (native streaming API) |
|
|
| Google AI Studio | trustgraph-flow | Full (native streaming API) |
|
|
| VertexAI | trustgraph-vertexai | Full (Phase 2) |
|
|
| Bedrock | trustgraph-bedrock | Full (native streaming API) |
|
|
| LM Studio | trustgraph-flow | Full (OpenAI-compatible) |
|
|
| LlamaFile | trustgraph-flow | Full (OpenAI-compatible) |
|
|
| vLLM | trustgraph-flow | Full (OpenAI-compatible) |
|
|
| TGI | trustgraph-flow | TBD |
|
|
| Azure | trustgraph-flow | TBD |
|
|
|
|
#### Implementation Pattern
|
|
|
|
For OpenAI-compatible providers (OpenAI, LM Studio, LlamaFile, vLLM):
|
|
|
|
```python
|
|
async def generate_content_stream(self, system, prompt, model, temperature):
|
|
response = await self.client.chat.completions.create(
|
|
model=model,
|
|
messages=[
|
|
{"role": "system", "content": system},
|
|
{"role": "user", "content": prompt}
|
|
],
|
|
temperature=temperature,
|
|
stream=True
|
|
)
|
|
async for chunk in response:
|
|
if chunk.choices[0].delta.content:
|
|
yield LlmChunk(text=chunk.choices[0].delta.content)
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 4: Agent API
|
|
|
|
Phase 4 extends streaming to the Agent API. This is more complex because the
|
|
Agent API is already multi-message by nature (thought → action → observation
|
|
→ repeat → final answer).
|
|
|
|
#### Current Agent Schema
|
|
|
|
```python
|
|
class AgentStep(Record):
|
|
thought = String()
|
|
action = String()
|
|
arguments = Map(String())
|
|
observation = String()
|
|
user = String()
|
|
|
|
class AgentRequest(Record):
|
|
question = String()
|
|
state = String()
|
|
group = Array(String())
|
|
history = Array(AgentStep())
|
|
user = String()
|
|
|
|
class AgentResponse(Record):
|
|
answer = String()
|
|
error = Error()
|
|
thought = String()
|
|
observation = String()
|
|
```
|
|
|
|
#### Proposed Agent Schema Changes
|
|
|
|
**Request Changes:**
|
|
|
|
```python
|
|
class AgentRequest(Record):
|
|
question = String()
|
|
state = String()
|
|
group = Array(String())
|
|
history = Array(AgentStep())
|
|
user = String()
|
|
streaming = Boolean() # NEW: Default false
|
|
```
|
|
|
|
**Response Changes:**
|
|
|
|
The agent produces multiple types of output during its reasoning cycle:
|
|
- Thoughts (reasoning)
|
|
- Actions (tool calls)
|
|
- Observations (tool results)
|
|
- Answer (final response)
|
|
- Errors
|
|
|
|
Since `chunk_type` identifies what kind of content is being sent, the separate
|
|
`answer`, `error`, `thought`, and `observation` fields can be collapsed into
|
|
a single `content` field:
|
|
|
|
```python
|
|
class AgentResponse(Record):
|
|
chunk_type = String() # "thought", "action", "observation", "answer", "error"
|
|
content = String() # The actual content (interpretation depends on chunk_type)
|
|
end_of_message = Boolean() # Current thought/action/observation/answer is complete
|
|
end_of_dialog = Boolean() # Entire agent dialog is complete
|
|
```
|
|
|
|
**Field Semantics:**
|
|
|
|
- `chunk_type`: Indicates what type of content is in the `content` field
|
|
- `"thought"`: Agent reasoning/thinking
|
|
- `"action"`: Tool/action being invoked
|
|
- `"observation"`: Result from tool execution
|
|
- `"answer"`: Final answer to the user's question
|
|
- `"error"`: Error message
|
|
|
|
- `content`: The actual streamed content, interpreted based on `chunk_type`
|
|
|
|
- `end_of_message`: When `true`, the current chunk type is complete
|
|
- Example: All tokens for the current thought have been sent
|
|
- Allows clients to know when to move to the next stage
|
|
|
|
- `end_of_dialog`: When `true`, the entire agent interaction is complete
|
|
- This is the final message in the stream
|
|
|
|
#### Agent Streaming Behavior
|
|
|
|
When `streaming=true`:
|
|
|
|
1. **Thought streaming**:
|
|
- Multiple chunks with `chunk_type="thought"`, `end_of_message=false`
|
|
- Final thought chunk has `end_of_message=true`
|
|
2. **Action notification**:
|
|
- Single chunk with `chunk_type="action"`, `end_of_message=true`
|
|
3. **Observation**:
|
|
- Chunk(s) with `chunk_type="observation"`, final has `end_of_message=true`
|
|
4. **Repeat** steps 1-3 as the agent reasons
|
|
5. **Final answer**:
|
|
- `chunk_type="answer"` with the final response in `content`
|
|
- Last chunk has `end_of_message=true`, `end_of_dialog=true`
|
|
|
|
**Example Stream Sequence:**
|
|
|
|
```
|
|
{chunk_type: "thought", content: "I need to", end_of_message: false, end_of_dialog: false}
|
|
{chunk_type: "thought", content: " search for...", end_of_message: true, end_of_dialog: false}
|
|
{chunk_type: "action", content: "search", end_of_message: true, end_of_dialog: false}
|
|
{chunk_type: "observation", content: "Found: ...", end_of_message: true, end_of_dialog: false}
|
|
{chunk_type: "thought", content: "Based on this", end_of_message: false, end_of_dialog: false}
|
|
{chunk_type: "thought", content: " I can answer...", end_of_message: true, end_of_dialog: false}
|
|
{chunk_type: "answer", content: "The answer is...", end_of_message: true, end_of_dialog: true}
|
|
```
|
|
|
|
When `streaming=false`:
|
|
- Current behavior preserved
|
|
- Single response with complete answer
|
|
- `end_of_message=true`, `end_of_dialog=true`
|
|
|
|
#### Gateway and Python API
|
|
|
|
- Gateway: New SSE/WebSocket endpoint for agent streaming
|
|
- Python API: New `agent_stream()` async generator method
|
|
|
|
---
|
|
|
|
## Security Considerations
|
|
|
|
- **No new attack surface**: Streaming uses same authentication/authorization
|
|
- **Rate limiting**: Apply per-token or per-chunk rate limits if needed
|
|
- **Connection handling**: Properly terminate streams on client disconnect
|
|
- **Timeout management**: Streaming requests need appropriate timeout handling
|
|
|
|
## Performance Considerations
|
|
|
|
- **Memory**: Streaming reduces peak memory usage (no full response buffering)
|
|
- **Latency**: Time-to-first-token significantly reduced
|
|
- **Connection overhead**: SSE/WebSocket connections have keep-alive overhead
|
|
- **Pulsar throughput**: Multiple small messages vs. single large message
|
|
tradeoff
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests
|
|
- Schema serialization/deserialization with new fields
|
|
- Backward compatibility (missing fields use defaults)
|
|
- Chunk assembly logic
|
|
|
|
### Integration Tests
|
|
- Each LLM provider's streaming implementation
|
|
- Gateway API streaming endpoints
|
|
- Python client streaming methods
|
|
|
|
### End-to-End Tests
|
|
- CLI tool streaming output
|
|
- Full flow: Client → Gateway → Pulsar → LLM → back
|
|
- Mixed streaming/non-streaming workloads
|
|
|
|
### Backward Compatibility Tests
|
|
- Existing clients work without modification
|
|
- Non-streaming requests behave identically
|
|
|
|
## Migration Plan
|
|
|
|
### Phase 1: Infrastructure
|
|
- Deploy schema changes (backward compatible)
|
|
- Deploy Gateway API updates
|
|
- Deploy Python API updates
|
|
- Release CLI tool updates
|
|
|
|
### Phase 2: VertexAI
|
|
- Deploy VertexAI streaming implementation
|
|
- Validate with test workloads
|
|
|
|
### Phase 3: All Providers
|
|
- Roll out provider updates incrementally
|
|
- Monitor for issues
|
|
|
|
### Phase 4: Agent API
|
|
- Deploy agent schema changes
|
|
- Deploy agent streaming implementation
|
|
- Update documentation
|
|
|
|
## Timeline
|
|
|
|
| Phase | Description | Dependencies |
|
|
|-------|-------------|--------------|
|
|
| Phase 1 | Infrastructure | None |
|
|
| Phase 2 | VertexAI PoC | Phase 1 |
|
|
| Phase 3 | All Providers | Phase 2 |
|
|
| Phase 4 | Agent API | Phase 3 |
|
|
|
|
## Design Decisions
|
|
|
|
The following questions were resolved during specification:
|
|
|
|
1. **Token Counts in Streaming**: Token counts are deltas, not running totals.
|
|
Consumers can sum them if needed. This matches how most providers report
|
|
usage and simplifies the implementation.
|
|
|
|
2. **Error Handling in Streams**: If an error occurs, the `error` field is
|
|
populated and no other fields are needed. An error is always the final
|
|
communication - no subsequent messages are permitted or expected after
|
|
an error. For LLM/Prompt streams, `end_of_stream=true`. For Agent streams,
|
|
`chunk_type="error"` with `end_of_dialog=true`.
|
|
|
|
3. **Partial Response Recovery**: The messaging protocol (Pulsar) is resilient,
|
|
so message-level retry is not needed. If a client loses track of the stream
|
|
or disconnects, it must retry the full request from scratch.
|
|
|
|
4. **Prompt Service Streaming**: Streaming is only supported for text (`text`)
|
|
responses, not structured (`object`) responses. The prompt service knows at
|
|
the outset whether the output will be JSON or text based on the prompt
|
|
template. If a streaming request is made for a JSON-output prompt, the
|
|
service should either:
|
|
- Return the complete JSON in a single response with `end_of_stream=true`, or
|
|
- Reject the streaming request with an error
|
|
|
|
## Open Questions
|
|
|
|
None at this time.
|
|
|
|
## References
|
|
|
|
- Current LLM schema: `trustgraph-base/trustgraph/schema/services/llm.py`
|
|
- Current prompt schema: `trustgraph-base/trustgraph/schema/services/prompt.py`
|
|
- Current agent schema: `trustgraph-base/trustgraph/schema/services/agent.py`
|
|
- LLM service base: `trustgraph-base/trustgraph/base/llm_service.py`
|
|
- VertexAI provider: `trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py`
|
|
- Gateway API: `trustgraph-base/trustgraph/api/`
|
|
- CLI tools: `trustgraph-cli/trustgraph/cli/`
|