mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-26 08:56:21 +02:00
Feature/streaming llm phase 1 (#566)
* Tidy up duplicate tech specs in doc directory * Streaming LLM text-completion service tech spec. * text-completion and prompt interfaces * streaming change applied to all LLMs, so far tested with VertexAI * Skip Pinecone unit tests, upstream module issue is affecting things, tests are passing again * Added agent streaming, not working and has broken tests
This commit is contained in:
parent
943a9d83b0
commit
310a2deb06
44 changed files with 2684 additions and 937 deletions
570
docs/tech-specs/streaming-llm-responses.md
Normal file
570
docs/tech-specs/streaming-llm-responses.md
Normal file
|
|
@ -0,0 +1,570 @@
|
|||
# Streaming LLM Responses Technical Specification
|
||||
|
||||
## Overview
|
||||
|
||||
This specification describes the implementation of streaming support for LLM
|
||||
responses in TrustGraph. Streaming enables real-time delivery of generated
|
||||
tokens as they are produced by the LLM, rather than waiting for complete
|
||||
response generation.
|
||||
|
||||
This implementation supports the following use cases:
|
||||
|
||||
1. **Real-time User Interfaces**: Stream tokens to UI as they are generated,
|
||||
providing immediate visual feedback
|
||||
2. **Reduced Time-to-First-Token**: Users see output beginning immediately
|
||||
rather than waiting for full generation
|
||||
3. **Long Response Handling**: Handle very long outputs that might otherwise
|
||||
timeout or exceed memory limits
|
||||
4. **Interactive Applications**: Enable responsive chat and agent interfaces
|
||||
|
||||
## Goals
|
||||
|
||||
- **Backward Compatibility**: Existing non-streaming clients continue to work
|
||||
without modification
|
||||
- **Consistent API Design**: Streaming and non-streaming use the same schema
|
||||
patterns with minimal divergence
|
||||
- **Provider Flexibility**: Support streaming where available, graceful
|
||||
fallback where not
|
||||
- **Phased Rollout**: Incremental implementation to reduce risk
|
||||
- **End-to-End Support**: Streaming from LLM provider through to client
|
||||
applications via Pulsar, Gateway API, and Python API
|
||||
|
||||
## Background
|
||||
|
||||
### Current Architecture
|
||||
|
||||
The current LLM text completion flow operates as follows:
|
||||
|
||||
1. Client sends `TextCompletionRequest` with `system` and `prompt` fields
|
||||
2. LLM service processes the request and waits for complete generation
|
||||
3. Single `TextCompletionResponse` returned with complete `response` string
|
||||
|
||||
Current schema (`trustgraph-base/trustgraph/schema/services/llm.py`):
|
||||
|
||||
```python
|
||||
class TextCompletionRequest(Record):
|
||||
system = String()
|
||||
prompt = String()
|
||||
|
||||
class TextCompletionResponse(Record):
|
||||
error = Error()
|
||||
response = String()
|
||||
in_token = Integer()
|
||||
out_token = Integer()
|
||||
model = String()
|
||||
```
|
||||
|
||||
### Current Limitations
|
||||
|
||||
- **Latency**: Users must wait for complete generation before seeing any output
|
||||
- **Timeout Risk**: Long generations may exceed client timeout thresholds
|
||||
- **Poor UX**: No feedback during generation creates perception of slowness
|
||||
- **Resource Usage**: Full responses must be buffered in memory
|
||||
|
||||
This specification addresses these limitations by enabling incremental response
|
||||
delivery while maintaining full backward compatibility.
|
||||
|
||||
## Technical Design
|
||||
|
||||
### Phase 1: Infrastructure
|
||||
|
||||
Phase 1 establishes the foundation for streaming by modifying schemas, APIs,
|
||||
and CLI tools.
|
||||
|
||||
#### Schema Changes
|
||||
|
||||
##### LLM Schema (`trustgraph-base/trustgraph/schema/services/llm.py`)
|
||||
|
||||
**Request Changes:**
|
||||
|
||||
```python
|
||||
class TextCompletionRequest(Record):
|
||||
system = String()
|
||||
prompt = String()
|
||||
streaming = Boolean() # NEW: Default false for backward compatibility
|
||||
```
|
||||
|
||||
- `streaming`: When `true`, requests streaming response delivery
|
||||
- Default: `false` (existing behavior preserved)
|
||||
|
||||
**Response Changes:**
|
||||
|
||||
```python
|
||||
class TextCompletionResponse(Record):
|
||||
error = Error()
|
||||
response = String()
|
||||
in_token = Integer()
|
||||
out_token = Integer()
|
||||
model = String()
|
||||
end_of_stream = Boolean() # NEW: Indicates final message
|
||||
```
|
||||
|
||||
- `end_of_stream`: When `true`, indicates this is the final (or only) response
|
||||
- For non-streaming requests: Single response with `end_of_stream=true`
|
||||
- For streaming requests: Multiple responses, all with `end_of_stream=false`
|
||||
except the final one
|
||||
|
||||
##### Prompt Schema (`trustgraph-base/trustgraph/schema/services/prompt.py`)
|
||||
|
||||
The prompt service wraps text completion, so it mirrors the same pattern:
|
||||
|
||||
**Request Changes:**
|
||||
|
||||
```python
|
||||
class PromptRequest(Record):
|
||||
id = String()
|
||||
terms = Map(String())
|
||||
streaming = Boolean() # NEW: Default false
|
||||
```
|
||||
|
||||
**Response Changes:**
|
||||
|
||||
```python
|
||||
class PromptResponse(Record):
|
||||
error = Error()
|
||||
text = String()
|
||||
object = String()
|
||||
end_of_stream = Boolean() # NEW: Indicates final message
|
||||
```
|
||||
|
||||
#### Gateway API Changes
|
||||
|
||||
The Gateway API must expose streaming capabilities to HTTP/WebSocket clients.
|
||||
|
||||
**REST API Updates:**
|
||||
|
||||
- `POST /api/v1/text-completion`: Accept `streaming` parameter in request body
|
||||
- Response behavior depends on streaming flag:
|
||||
- `streaming=false`: Single JSON response (current behavior)
|
||||
- `streaming=true`: Server-Sent Events (SSE) stream or WebSocket messages
|
||||
|
||||
**Response Format (Streaming):**
|
||||
|
||||
Each streamed chunk follows the same schema structure:
|
||||
```json
|
||||
{
|
||||
"response": "partial text...",
|
||||
"end_of_stream": false,
|
||||
"model": "model-name"
|
||||
}
|
||||
```
|
||||
|
||||
Final chunk:
|
||||
```json
|
||||
{
|
||||
"response": "final text chunk",
|
||||
"end_of_stream": true,
|
||||
"in_token": 150,
|
||||
"out_token": 500,
|
||||
"model": "model-name"
|
||||
}
|
||||
```
|
||||
|
||||
#### Python API Changes
|
||||
|
||||
The Python client API must support both streaming and non-streaming modes
|
||||
while maintaining backward compatibility.
|
||||
|
||||
**LlmClient Updates** (`trustgraph-base/trustgraph/clients/llm_client.py`):
|
||||
|
||||
```python
|
||||
class LlmClient(BaseClient):
|
||||
def request(self, system, prompt, timeout=300, streaming=False):
|
||||
"""
|
||||
Non-streaming request (backward compatible).
|
||||
Returns complete response string.
|
||||
"""
|
||||
# Existing behavior when streaming=False
|
||||
|
||||
async def request_stream(self, system, prompt, timeout=300):
|
||||
"""
|
||||
Streaming request.
|
||||
Yields response chunks as they arrive.
|
||||
"""
|
||||
# New async generator method
|
||||
```
|
||||
|
||||
**PromptClient Updates** (`trustgraph-base/trustgraph/base/prompt_client.py`):
|
||||
|
||||
Similar pattern with `streaming` parameter and async generator variant.
|
||||
|
||||
#### CLI Tool Changes
|
||||
|
||||
**tg-invoke-llm** (`trustgraph-cli/trustgraph/cli/invoke_llm.py`):
|
||||
|
||||
```
|
||||
tg-invoke-llm [system] [prompt] [--no-streaming] [-u URL] [-f flow-id]
|
||||
```
|
||||
|
||||
- Streaming enabled by default for better interactive UX
|
||||
- `--no-streaming` flag disables streaming
|
||||
- When streaming: Output tokens to stdout as they arrive
|
||||
- When not streaming: Wait for complete response, then output
|
||||
|
||||
**tg-invoke-prompt** (`trustgraph-cli/trustgraph/cli/invoke_prompt.py`):
|
||||
|
||||
```
|
||||
tg-invoke-prompt [template-id] [var=value...] [--no-streaming] [-u URL] [-f flow-id]
|
||||
```
|
||||
|
||||
Same pattern as `tg-invoke-llm`.
|
||||
|
||||
#### LLM Service Base Class Changes
|
||||
|
||||
**LlmService** (`trustgraph-base/trustgraph/base/llm_service.py`):
|
||||
|
||||
```python
|
||||
class LlmService(FlowProcessor):
|
||||
async def on_request(self, msg, consumer, flow):
|
||||
request = msg.value()
|
||||
streaming = getattr(request, 'streaming', False)
|
||||
|
||||
if streaming and self.supports_streaming():
|
||||
async for chunk in self.generate_content_stream(...):
|
||||
await self.send_response(chunk, end_of_stream=False)
|
||||
await self.send_response(final_chunk, end_of_stream=True)
|
||||
else:
|
||||
response = await self.generate_content(...)
|
||||
await self.send_response(response, end_of_stream=True)
|
||||
|
||||
def supports_streaming(self):
|
||||
"""Override in subclass to indicate streaming support."""
|
||||
return False
|
||||
|
||||
async def generate_content_stream(self, system, prompt, model, temperature):
|
||||
"""Override in subclass to implement streaming."""
|
||||
raise NotImplementedError()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Phase 2: VertexAI Proof of Concept
|
||||
|
||||
Phase 2 implements streaming in a single provider (VertexAI) to validate the
|
||||
infrastructure and enable end-to-end testing.
|
||||
|
||||
#### VertexAI Implementation
|
||||
|
||||
**Module:** `trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py`
|
||||
|
||||
**Changes:**
|
||||
|
||||
1. Override `supports_streaming()` to return `True`
|
||||
2. Implement `generate_content_stream()` async generator
|
||||
3. Handle both Gemini and Claude models (via VertexAI Anthropic API)
|
||||
|
||||
**Gemini Streaming:**
|
||||
|
||||
```python
|
||||
async def generate_content_stream(self, system, prompt, model, temperature):
|
||||
model_instance = self.get_model(model, temperature)
|
||||
response = model_instance.generate_content(
|
||||
[system, prompt],
|
||||
stream=True # Enable streaming
|
||||
)
|
||||
for chunk in response:
|
||||
yield LlmChunk(
|
||||
text=chunk.text,
|
||||
in_token=None, # Available only in final chunk
|
||||
out_token=None,
|
||||
)
|
||||
# Final chunk includes token counts from response.usage_metadata
|
||||
```
|
||||
|
||||
**Claude (via VertexAI Anthropic) Streaming:**
|
||||
|
||||
```python
|
||||
async def generate_content_stream(self, system, prompt, model, temperature):
|
||||
with self.anthropic_client.messages.stream(...) as stream:
|
||||
for text in stream.text_stream:
|
||||
yield LlmChunk(text=text)
|
||||
# Token counts from stream.get_final_message()
|
||||
```
|
||||
|
||||
#### Testing
|
||||
|
||||
- Unit tests for streaming response assembly
|
||||
- Integration tests with VertexAI (Gemini and Claude)
|
||||
- End-to-end tests: CLI -> Gateway -> Pulsar -> VertexAI -> back
|
||||
- Backward compatibility tests: Non-streaming requests still work
|
||||
|
||||
---
|
||||
|
||||
### Phase 3: All LLM Providers
|
||||
|
||||
Phase 3 extends streaming support to all LLM providers in the system.
|
||||
|
||||
#### Provider Implementation Status
|
||||
|
||||
Each provider must either:
|
||||
1. **Full Streaming Support**: Implement `generate_content_stream()`
|
||||
2. **Compatibility Mode**: Handle the `end_of_stream` flag correctly
|
||||
(return single response with `end_of_stream=true`)
|
||||
|
||||
| Provider | Package | Streaming Support |
|
||||
|----------|---------|-------------------|
|
||||
| OpenAI | trustgraph-flow | Full (native streaming API) |
|
||||
| Claude/Anthropic | trustgraph-flow | Full (native streaming API) |
|
||||
| Ollama | trustgraph-flow | Full (native streaming API) |
|
||||
| Cohere | trustgraph-flow | Full (native streaming API) |
|
||||
| Mistral | trustgraph-flow | Full (native streaming API) |
|
||||
| Azure OpenAI | trustgraph-flow | Full (native streaming API) |
|
||||
| Google AI Studio | trustgraph-flow | Full (native streaming API) |
|
||||
| VertexAI | trustgraph-vertexai | Full (Phase 2) |
|
||||
| Bedrock | trustgraph-bedrock | Full (native streaming API) |
|
||||
| LM Studio | trustgraph-flow | Full (OpenAI-compatible) |
|
||||
| LlamaFile | trustgraph-flow | Full (OpenAI-compatible) |
|
||||
| vLLM | trustgraph-flow | Full (OpenAI-compatible) |
|
||||
| TGI | trustgraph-flow | TBD |
|
||||
| Azure | trustgraph-flow | TBD |
|
||||
|
||||
#### Implementation Pattern
|
||||
|
||||
For OpenAI-compatible providers (OpenAI, LM Studio, LlamaFile, vLLM):
|
||||
|
||||
```python
|
||||
async def generate_content_stream(self, system, prompt, model, temperature):
|
||||
response = await self.client.chat.completions.create(
|
||||
model=model,
|
||||
messages=[
|
||||
{"role": "system", "content": system},
|
||||
{"role": "user", "content": prompt}
|
||||
],
|
||||
temperature=temperature,
|
||||
stream=True
|
||||
)
|
||||
async for chunk in response:
|
||||
if chunk.choices[0].delta.content:
|
||||
yield LlmChunk(text=chunk.choices[0].delta.content)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Phase 4: Agent API
|
||||
|
||||
Phase 4 extends streaming to the Agent API. This is more complex because the
|
||||
Agent API is already multi-message by nature (thought → action → observation
|
||||
→ repeat → final answer).
|
||||
|
||||
#### Current Agent Schema
|
||||
|
||||
```python
|
||||
class AgentStep(Record):
|
||||
thought = String()
|
||||
action = String()
|
||||
arguments = Map(String())
|
||||
observation = String()
|
||||
user = String()
|
||||
|
||||
class AgentRequest(Record):
|
||||
question = String()
|
||||
state = String()
|
||||
group = Array(String())
|
||||
history = Array(AgentStep())
|
||||
user = String()
|
||||
|
||||
class AgentResponse(Record):
|
||||
answer = String()
|
||||
error = Error()
|
||||
thought = String()
|
||||
observation = String()
|
||||
```
|
||||
|
||||
#### Proposed Agent Schema Changes
|
||||
|
||||
**Request Changes:**
|
||||
|
||||
```python
|
||||
class AgentRequest(Record):
|
||||
question = String()
|
||||
state = String()
|
||||
group = Array(String())
|
||||
history = Array(AgentStep())
|
||||
user = String()
|
||||
streaming = Boolean() # NEW: Default false
|
||||
```
|
||||
|
||||
**Response Changes:**
|
||||
|
||||
The agent produces multiple types of output during its reasoning cycle:
|
||||
- Thoughts (reasoning)
|
||||
- Actions (tool calls)
|
||||
- Observations (tool results)
|
||||
- Answer (final response)
|
||||
- Errors
|
||||
|
||||
Since `chunk_type` identifies what kind of content is being sent, the separate
|
||||
`answer`, `error`, `thought`, and `observation` fields can be collapsed into
|
||||
a single `content` field:
|
||||
|
||||
```python
|
||||
class AgentResponse(Record):
|
||||
chunk_type = String() # "thought", "action", "observation", "answer", "error"
|
||||
content = String() # The actual content (interpretation depends on chunk_type)
|
||||
end_of_message = Boolean() # Current thought/action/observation/answer is complete
|
||||
end_of_dialog = Boolean() # Entire agent dialog is complete
|
||||
```
|
||||
|
||||
**Field Semantics:**
|
||||
|
||||
- `chunk_type`: Indicates what type of content is in the `content` field
|
||||
- `"thought"`: Agent reasoning/thinking
|
||||
- `"action"`: Tool/action being invoked
|
||||
- `"observation"`: Result from tool execution
|
||||
- `"answer"`: Final answer to the user's question
|
||||
- `"error"`: Error message
|
||||
|
||||
- `content`: The actual streamed content, interpreted based on `chunk_type`
|
||||
|
||||
- `end_of_message`: When `true`, the current chunk type is complete
|
||||
- Example: All tokens for the current thought have been sent
|
||||
- Allows clients to know when to move to the next stage
|
||||
|
||||
- `end_of_dialog`: When `true`, the entire agent interaction is complete
|
||||
- This is the final message in the stream
|
||||
|
||||
#### Agent Streaming Behavior
|
||||
|
||||
When `streaming=true`:
|
||||
|
||||
1. **Thought streaming**:
|
||||
- Multiple chunks with `chunk_type="thought"`, `end_of_message=false`
|
||||
- Final thought chunk has `end_of_message=true`
|
||||
2. **Action notification**:
|
||||
- Single chunk with `chunk_type="action"`, `end_of_message=true`
|
||||
3. **Observation**:
|
||||
- Chunk(s) with `chunk_type="observation"`, final has `end_of_message=true`
|
||||
4. **Repeat** steps 1-3 as the agent reasons
|
||||
5. **Final answer**:
|
||||
- `chunk_type="answer"` with the final response in `content`
|
||||
- Last chunk has `end_of_message=true`, `end_of_dialog=true`
|
||||
|
||||
**Example Stream Sequence:**
|
||||
|
||||
```
|
||||
{chunk_type: "thought", content: "I need to", end_of_message: false, end_of_dialog: false}
|
||||
{chunk_type: "thought", content: " search for...", end_of_message: true, end_of_dialog: false}
|
||||
{chunk_type: "action", content: "search", end_of_message: true, end_of_dialog: false}
|
||||
{chunk_type: "observation", content: "Found: ...", end_of_message: true, end_of_dialog: false}
|
||||
{chunk_type: "thought", content: "Based on this", end_of_message: false, end_of_dialog: false}
|
||||
{chunk_type: "thought", content: " I can answer...", end_of_message: true, end_of_dialog: false}
|
||||
{chunk_type: "answer", content: "The answer is...", end_of_message: true, end_of_dialog: true}
|
||||
```
|
||||
|
||||
When `streaming=false`:
|
||||
- Current behavior preserved
|
||||
- Single response with complete answer
|
||||
- `end_of_message=true`, `end_of_dialog=true`
|
||||
|
||||
#### Gateway and Python API
|
||||
|
||||
- Gateway: New SSE/WebSocket endpoint for agent streaming
|
||||
- Python API: New `agent_stream()` async generator method
|
||||
|
||||
---
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- **No new attack surface**: Streaming uses same authentication/authorization
|
||||
- **Rate limiting**: Apply per-token or per-chunk rate limits if needed
|
||||
- **Connection handling**: Properly terminate streams on client disconnect
|
||||
- **Timeout management**: Streaming requests need appropriate timeout handling
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Memory**: Streaming reduces peak memory usage (no full response buffering)
|
||||
- **Latency**: Time-to-first-token significantly reduced
|
||||
- **Connection overhead**: SSE/WebSocket connections have keep-alive overhead
|
||||
- **Pulsar throughput**: Multiple small messages vs. single large message
|
||||
tradeoff
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
- Schema serialization/deserialization with new fields
|
||||
- Backward compatibility (missing fields use defaults)
|
||||
- Chunk assembly logic
|
||||
|
||||
### Integration Tests
|
||||
- Each LLM provider's streaming implementation
|
||||
- Gateway API streaming endpoints
|
||||
- Python client streaming methods
|
||||
|
||||
### End-to-End Tests
|
||||
- CLI tool streaming output
|
||||
- Full flow: Client → Gateway → Pulsar → LLM → back
|
||||
- Mixed streaming/non-streaming workloads
|
||||
|
||||
### Backward Compatibility Tests
|
||||
- Existing clients work without modification
|
||||
- Non-streaming requests behave identically
|
||||
|
||||
## Migration Plan
|
||||
|
||||
### Phase 1: Infrastructure
|
||||
- Deploy schema changes (backward compatible)
|
||||
- Deploy Gateway API updates
|
||||
- Deploy Python API updates
|
||||
- Release CLI tool updates
|
||||
|
||||
### Phase 2: VertexAI
|
||||
- Deploy VertexAI streaming implementation
|
||||
- Validate with test workloads
|
||||
|
||||
### Phase 3: All Providers
|
||||
- Roll out provider updates incrementally
|
||||
- Monitor for issues
|
||||
|
||||
### Phase 4: Agent API
|
||||
- Deploy agent schema changes
|
||||
- Deploy agent streaming implementation
|
||||
- Update documentation
|
||||
|
||||
## Timeline
|
||||
|
||||
| Phase | Description | Dependencies |
|
||||
|-------|-------------|--------------|
|
||||
| Phase 1 | Infrastructure | None |
|
||||
| Phase 2 | VertexAI PoC | Phase 1 |
|
||||
| Phase 3 | All Providers | Phase 2 |
|
||||
| Phase 4 | Agent API | Phase 3 |
|
||||
|
||||
## Design Decisions
|
||||
|
||||
The following questions were resolved during specification:
|
||||
|
||||
1. **Token Counts in Streaming**: Token counts are deltas, not running totals.
|
||||
Consumers can sum them if needed. This matches how most providers report
|
||||
usage and simplifies the implementation.
|
||||
|
||||
2. **Error Handling in Streams**: If an error occurs, the `error` field is
|
||||
populated and no other fields are needed. An error is always the final
|
||||
communication - no subsequent messages are permitted or expected after
|
||||
an error. For LLM/Prompt streams, `end_of_stream=true`. For Agent streams,
|
||||
`chunk_type="error"` with `end_of_dialog=true`.
|
||||
|
||||
3. **Partial Response Recovery**: The messaging protocol (Pulsar) is resilient,
|
||||
so message-level retry is not needed. If a client loses track of the stream
|
||||
or disconnects, it must retry the full request from scratch.
|
||||
|
||||
4. **Prompt Service Streaming**: Streaming is only supported for text (`text`)
|
||||
responses, not structured (`object`) responses. The prompt service knows at
|
||||
the outset whether the output will be JSON or text based on the prompt
|
||||
template. If a streaming request is made for a JSON-output prompt, the
|
||||
service should either:
|
||||
- Return the complete JSON in a single response with `end_of_stream=true`, or
|
||||
- Reject the streaming request with an error
|
||||
|
||||
## Open Questions
|
||||
|
||||
None at this time.
|
||||
|
||||
## References
|
||||
|
||||
- Current LLM schema: `trustgraph-base/trustgraph/schema/services/llm.py`
|
||||
- Current prompt schema: `trustgraph-base/trustgraph/schema/services/prompt.py`
|
||||
- Current agent schema: `trustgraph-base/trustgraph/schema/services/agent.py`
|
||||
- LLM service base: `trustgraph-base/trustgraph/base/llm_service.py`
|
||||
- VertexAI provider: `trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py`
|
||||
- Gateway API: `trustgraph-base/trustgraph/api/`
|
||||
- CLI tools: `trustgraph-cli/trustgraph/cli/`
|
||||
Loading…
Add table
Add a link
Reference in a new issue