Feature/streaming llm phase 1 (#566)

* Tidy up duplicate tech specs in doc directory * Streaming LLM text-completion service tech spec. * text-completion and prompt interfaces * streaming change applied to all LLMs, so far tested with VertexAI * Skip Pinecone unit tests, upstream module issue is affecting things, tests are passing again * Added agent streaming, not working and has broken tests
2026-06-26 07:08:06 +02:00 · 2025-11-26 09:59:10 +00:00 · 2025-11-26 09:59:10 +00:00 · 310a2deb06
commit 310a2deb06
parent 943a9d83b0
44 changed files with 2684 additions and 937 deletions
--- a/docs/tech-specs/streaming-llm-responses.md
+++ b/docs/tech-specs/streaming-llm-responses.md
@ -0,0 +1,570 @@
+# Streaming LLM Responses Technical Specification
+
+## Overview
+
+This specification describes the implementation of streaming support for LLM
+responses in TrustGraph. Streaming enables real-time delivery of generated
+tokens as they are produced by the LLM, rather than waiting for complete
+response generation.
+
+This implementation supports the following use cases:
+
+1. **Real-time User Interfaces**: Stream tokens to UI as they are generated,
+   providing immediate visual feedback
+2. **Reduced Time-to-First-Token**: Users see output beginning immediately
+   rather than waiting for full generation
+3. **Long Response Handling**: Handle very long outputs that might otherwise
+   timeout or exceed memory limits
+4. **Interactive Applications**: Enable responsive chat and agent interfaces
+
+## Goals
+
+- **Backward Compatibility**: Existing non-streaming clients continue to work
+  without modification
+- **Consistent API Design**: Streaming and non-streaming use the same schema
+  patterns with minimal divergence
+- **Provider Flexibility**: Support streaming where available, graceful
+  fallback where not
+- **Phased Rollout**: Incremental implementation to reduce risk
+- **End-to-End Support**: Streaming from LLM provider through to client
+  applications via Pulsar, Gateway API, and Python API
+
+## Background
+
+### Current Architecture
+
+The current LLM text completion flow operates as follows:
+
+1. Client sends `TextCompletionRequest` with `system` and `prompt` fields
+2. LLM service processes the request and waits for complete generation
+3. Single `TextCompletionResponse` returned with complete `response` string
+
+Current schema (`trustgraph-base/trustgraph/schema/services/llm.py`):
+
+```python
+class TextCompletionRequest(Record):
+    system = String()
+    prompt = String()
+
+class TextCompletionResponse(Record):
+    error = Error()
+    response = String()
+    in_token = Integer()
+    out_token = Integer()
+    model = String()
+```
+
+### Current Limitations
+
+- **Latency**: Users must wait for complete generation before seeing any output
+- **Timeout Risk**: Long generations may exceed client timeout thresholds
+- **Poor UX**: No feedback during generation creates perception of slowness
+- **Resource Usage**: Full responses must be buffered in memory
+
+This specification addresses these limitations by enabling incremental response
+delivery while maintaining full backward compatibility.
+
+## Technical Design
+
+### Phase 1: Infrastructure
+
+Phase 1 establishes the foundation for streaming by modifying schemas, APIs,
+and CLI tools.
+
+#### Schema Changes
+
+##### LLM Schema (`trustgraph-base/trustgraph/schema/services/llm.py`)
+
+**Request Changes:**
+
+```python
+class TextCompletionRequest(Record):
+    system = String()
+    prompt = String()
+    streaming = Boolean()  # NEW: Default false for backward compatibility
+```
+
+- `streaming`: When `true`, requests streaming response delivery
+- Default: `false` (existing behavior preserved)
+
+**Response Changes:**
+
+```python
+class TextCompletionResponse(Record):
+    error = Error()
+    response = String()
+    in_token = Integer()
+    out_token = Integer()
+    model = String()
+    end_of_stream = Boolean()  # NEW: Indicates final message
+```
+
+- `end_of_stream`: When `true`, indicates this is the final (or only) response
+- For non-streaming requests: Single response with `end_of_stream=true`
+- For streaming requests: Multiple responses, all with `end_of_stream=false`
+  except the final one
+
+##### Prompt Schema (`trustgraph-base/trustgraph/schema/services/prompt.py`)
+
+The prompt service wraps text completion, so it mirrors the same pattern:
+
+**Request Changes:**
+
+```python
+class PromptRequest(Record):
+    id = String()
+    terms = Map(String())
+    streaming = Boolean()  # NEW: Default false
+```
+
+**Response Changes:**
+
+```python
+class PromptResponse(Record):
+    error = Error()
+    text = String()
+    object = String()
+    end_of_stream = Boolean()  # NEW: Indicates final message
+```
+
+#### Gateway API Changes
+
+The Gateway API must expose streaming capabilities to HTTP/WebSocket clients.
+
+**REST API Updates:**
+
+- `POST /api/v1/text-completion`: Accept `streaming` parameter in request body
+- Response behavior depends on streaming flag:
+  - `streaming=false`: Single JSON response (current behavior)
+  - `streaming=true`: Server-Sent Events (SSE) stream or WebSocket messages
+
+**Response Format (Streaming):**
+
+Each streamed chunk follows the same schema structure:
+```json
+{
+  "response": "partial text...",
+  "end_of_stream": false,
+  "model": "model-name"
+}
+```
+
+Final chunk:
+```json
+{
+  "response": "final text chunk",
+  "end_of_stream": true,
+  "in_token": 150,
+  "out_token": 500,
+  "model": "model-name"
+}
+```
+
+#### Python API Changes
+
+The Python client API must support both streaming and non-streaming modes
+while maintaining backward compatibility.
+
+**LlmClient Updates** (`trustgraph-base/trustgraph/clients/llm_client.py`):
+
+```python
+class LlmClient(BaseClient):
+    def request(self, system, prompt, timeout=300, streaming=False):
+        """
+        Non-streaming request (backward compatible).
+        Returns complete response string.
+        """
+        # Existing behavior when streaming=False
+
+    async def request_stream(self, system, prompt, timeout=300):
+        """
+        Streaming request.
+        Yields response chunks as they arrive.
+        """
+        # New async generator method
+```
+
+**PromptClient Updates** (`trustgraph-base/trustgraph/base/prompt_client.py`):
+
+Similar pattern with `streaming` parameter and async generator variant.
+
+#### CLI Tool Changes
+
+**tg-invoke-llm** (`trustgraph-cli/trustgraph/cli/invoke_llm.py`):
+
+```
+tg-invoke-llm [system] [prompt] [--no-streaming] [-u URL] [-f flow-id]
+```
+
+- Streaming enabled by default for better interactive UX
+- `--no-streaming` flag disables streaming
+- When streaming: Output tokens to stdout as they arrive
+- When not streaming: Wait for complete response, then output
+
+**tg-invoke-prompt** (`trustgraph-cli/trustgraph/cli/invoke_prompt.py`):
+
+```
+tg-invoke-prompt [template-id] [var=value...] [--no-streaming] [-u URL] [-f flow-id]
+```
+
+Same pattern as `tg-invoke-llm`.
+
+#### LLM Service Base Class Changes
+
+**LlmService** (`trustgraph-base/trustgraph/base/llm_service.py`):
+
+```python
+class LlmService(FlowProcessor):
+    async def on_request(self, msg, consumer, flow):
+        request = msg.value()
+        streaming = getattr(request, 'streaming', False)
+
+        if streaming and self.supports_streaming():
+            async for chunk in self.generate_content_stream(...):
+                await self.send_response(chunk, end_of_stream=False)
+            await self.send_response(final_chunk, end_of_stream=True)
+        else:
+            response = await self.generate_content(...)
+            await self.send_response(response, end_of_stream=True)
+
+    def supports_streaming(self):
+        """Override in subclass to indicate streaming support."""
+        return False
+
+    async def generate_content_stream(self, system, prompt, model, temperature):
+        """Override in subclass to implement streaming."""
+        raise NotImplementedError()
+```
+
+---
+
+### Phase 2: VertexAI Proof of Concept
+
+Phase 2 implements streaming in a single provider (VertexAI) to validate the
+infrastructure and enable end-to-end testing.
+
+#### VertexAI Implementation
+
+**Module:** `trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py`
+
+**Changes:**
+
+1. Override `supports_streaming()` to return `True`
+2. Implement `generate_content_stream()` async generator
+3. Handle both Gemini and Claude models (via VertexAI Anthropic API)
+
+**Gemini Streaming:**
+
+```python
+async def generate_content_stream(self, system, prompt, model, temperature):
+    model_instance = self.get_model(model, temperature)
+    response = model_instance.generate_content(
+        [system, prompt],
+        stream=True  # Enable streaming
+    )
+    for chunk in response:
+        yield LlmChunk(
+            text=chunk.text,
+            in_token=None,  # Available only in final chunk
+            out_token=None,
+        )
+    # Final chunk includes token counts from response.usage_metadata
+```
+
+**Claude (via VertexAI Anthropic) Streaming:**
+
+```python
+async def generate_content_stream(self, system, prompt, model, temperature):
+    with self.anthropic_client.messages.stream(...) as stream:
+        for text in stream.text_stream:
+            yield LlmChunk(text=text)
+    # Token counts from stream.get_final_message()
+```
+
+#### Testing
+
+- Unit tests for streaming response assembly
+- Integration tests with VertexAI (Gemini and Claude)
+- End-to-end tests: CLI -> Gateway -> Pulsar -> VertexAI -> back
+- Backward compatibility tests: Non-streaming requests still work
+
+---
+
+### Phase 3: All LLM Providers
+
+Phase 3 extends streaming support to all LLM providers in the system.
+
+#### Provider Implementation Status
+
+Each provider must either:
+1. **Full Streaming Support**: Implement `generate_content_stream()`
+2. **Compatibility Mode**: Handle the `end_of_stream` flag correctly
+   (return single response with `end_of_stream=true`)
+
+| Provider | Package | Streaming Support |
+|----------|---------|-------------------|
+| OpenAI | trustgraph-flow | Full (native streaming API) |
+| Claude/Anthropic | trustgraph-flow | Full (native streaming API) |
+| Ollama | trustgraph-flow | Full (native streaming API) |
+| Cohere | trustgraph-flow | Full (native streaming API) |
+| Mistral | trustgraph-flow | Full (native streaming API) |
+| Azure OpenAI | trustgraph-flow | Full (native streaming API) |
+| Google AI Studio | trustgraph-flow | Full (native streaming API) |
+| VertexAI | trustgraph-vertexai | Full (Phase 2) |
+| Bedrock | trustgraph-bedrock | Full (native streaming API) |
+| LM Studio | trustgraph-flow | Full (OpenAI-compatible) |
+| LlamaFile | trustgraph-flow | Full (OpenAI-compatible) |
+| vLLM | trustgraph-flow | Full (OpenAI-compatible) |
+| TGI | trustgraph-flow | TBD |
+| Azure | trustgraph-flow | TBD |
+
+#### Implementation Pattern
+
+For OpenAI-compatible providers (OpenAI, LM Studio, LlamaFile, vLLM):
+
+```python
+async def generate_content_stream(self, system, prompt, model, temperature):
+    response = await self.client.chat.completions.create(
+        model=model,
+        messages=[
+            {"role": "system", "content": system},
+            {"role": "user", "content": prompt}
+        ],
+        temperature=temperature,
+        stream=True
+    )
+    async for chunk in response:
+        if chunk.choices[0].delta.content:
+            yield LlmChunk(text=chunk.choices[0].delta.content)
+```
+
+---
+
+### Phase 4: Agent API
+
+Phase 4 extends streaming to the Agent API. This is more complex because the
+Agent API is already multi-message by nature (thought → action → observation
+→ repeat → final answer).
+
+#### Current Agent Schema
+
+```python
+class AgentStep(Record):
+    thought = String()
+    action = String()
+    arguments = Map(String())
+    observation = String()
+    user = String()
+
+class AgentRequest(Record):
+    question = String()
+    state = String()
+    group = Array(String())
+    history = Array(AgentStep())
+    user = String()
+
+class AgentResponse(Record):
+    answer = String()
+    error = Error()
+    thought = String()
+    observation = String()
+```
+
+#### Proposed Agent Schema Changes
+
+**Request Changes:**
+
+```python
+class AgentRequest(Record):
+    question = String()
+    state = String()
+    group = Array(String())
+    history = Array(AgentStep())
+    user = String()
+    streaming = Boolean()  # NEW: Default false
+```
+
+**Response Changes:**
+
+The agent produces multiple types of output during its reasoning cycle:
+- Thoughts (reasoning)
+- Actions (tool calls)
+- Observations (tool results)
+- Answer (final response)
+- Errors
+
+Since `chunk_type` identifies what kind of content is being sent, the separate
+`answer`, `error`, `thought`, and `observation` fields can be collapsed into
+a single `content` field:
+
+```python
+class AgentResponse(Record):
+    chunk_type = String()       # "thought", "action", "observation", "answer", "error"
+    content = String()          # The actual content (interpretation depends on chunk_type)
+    end_of_message = Boolean()  # Current thought/action/observation/answer is complete
+    end_of_dialog = Boolean()   # Entire agent dialog is complete
+```
+
+**Field Semantics:**
+
+- `chunk_type`: Indicates what type of content is in the `content` field
+  - `"thought"`: Agent reasoning/thinking
+  - `"action"`: Tool/action being invoked
+  - `"observation"`: Result from tool execution
+  - `"answer"`: Final answer to the user's question
+  - `"error"`: Error message
+
+- `content`: The actual streamed content, interpreted based on `chunk_type`
+
+- `end_of_message`: When `true`, the current chunk type is complete
+  - Example: All tokens for the current thought have been sent
+  - Allows clients to know when to move to the next stage
+
+- `end_of_dialog`: When `true`, the entire agent interaction is complete
+  - This is the final message in the stream
+
+#### Agent Streaming Behavior
+
+When `streaming=true`:
+
+1. **Thought streaming**:
+   - Multiple chunks with `chunk_type="thought"`, `end_of_message=false`
+   - Final thought chunk has `end_of_message=true`
+2. **Action notification**:
+   - Single chunk with `chunk_type="action"`, `end_of_message=true`
+3. **Observation**:
+   - Chunk(s) with `chunk_type="observation"`, final has `end_of_message=true`
+4. **Repeat** steps 1-3 as the agent reasons
+5. **Final answer**:
+   - `chunk_type="answer"` with the final response in `content`
+   - Last chunk has `end_of_message=true`, `end_of_dialog=true`
+
+**Example Stream Sequence:**
+
+```
+{chunk_type: "thought", content: "I need to", end_of_message: false, end_of_dialog: false}
+{chunk_type: "thought", content: " search for...", end_of_message: true, end_of_dialog: false}
+{chunk_type: "action", content: "search", end_of_message: true, end_of_dialog: false}
+{chunk_type: "observation", content: "Found: ...", end_of_message: true, end_of_dialog: false}
+{chunk_type: "thought", content: "Based on this", end_of_message: false, end_of_dialog: false}
+{chunk_type: "thought", content: " I can answer...", end_of_message: true, end_of_dialog: false}
+{chunk_type: "answer", content: "The answer is...", end_of_message: true, end_of_dialog: true}
+```
+
+When `streaming=false`:
+- Current behavior preserved
+- Single response with complete answer
+- `end_of_message=true`, `end_of_dialog=true`
+
+#### Gateway and Python API
+
+- Gateway: New SSE/WebSocket endpoint for agent streaming
+- Python API: New `agent_stream()` async generator method
+
+---
+
+## Security Considerations
+
+- **No new attack surface**: Streaming uses same authentication/authorization
+- **Rate limiting**: Apply per-token or per-chunk rate limits if needed
+- **Connection handling**: Properly terminate streams on client disconnect
+- **Timeout management**: Streaming requests need appropriate timeout handling
+
+## Performance Considerations
+
+- **Memory**: Streaming reduces peak memory usage (no full response buffering)
+- **Latency**: Time-to-first-token significantly reduced
+- **Connection overhead**: SSE/WebSocket connections have keep-alive overhead
+- **Pulsar throughput**: Multiple small messages vs. single large message
+  tradeoff
+
+## Testing Strategy
+
+### Unit Tests
+- Schema serialization/deserialization with new fields
+- Backward compatibility (missing fields use defaults)
+- Chunk assembly logic
+
+### Integration Tests
+- Each LLM provider's streaming implementation
+- Gateway API streaming endpoints
+- Python client streaming methods
+
+### End-to-End Tests
+- CLI tool streaming output
+- Full flow: Client → Gateway → Pulsar → LLM → back
+- Mixed streaming/non-streaming workloads
+
+### Backward Compatibility Tests
+- Existing clients work without modification
+- Non-streaming requests behave identically
+
+## Migration Plan
+
+### Phase 1: Infrastructure
+- Deploy schema changes (backward compatible)
+- Deploy Gateway API updates
+- Deploy Python API updates
+- Release CLI tool updates
+
+### Phase 2: VertexAI
+- Deploy VertexAI streaming implementation
+- Validate with test workloads
+
+### Phase 3: All Providers
+- Roll out provider updates incrementally
+- Monitor for issues
+
+### Phase 4: Agent API
+- Deploy agent schema changes
+- Deploy agent streaming implementation
+- Update documentation
+
+## Timeline
+
+| Phase | Description | Dependencies |
+|-------|-------------|--------------|
+| Phase 1 | Infrastructure | None |
+| Phase 2 | VertexAI PoC | Phase 1 |
+| Phase 3 | All Providers | Phase 2 |
+| Phase 4 | Agent API | Phase 3 |
+
+## Design Decisions
+
+The following questions were resolved during specification:
+
+1. **Token Counts in Streaming**: Token counts are deltas, not running totals.
+   Consumers can sum them if needed. This matches how most providers report
+   usage and simplifies the implementation.
+
+2. **Error Handling in Streams**: If an error occurs, the `error` field is
+   populated and no other fields are needed. An error is always the final
+   communication - no subsequent messages are permitted or expected after
+   an error. For LLM/Prompt streams, `end_of_stream=true`. For Agent streams,
+   `chunk_type="error"` with `end_of_dialog=true`.
+
+3. **Partial Response Recovery**: The messaging protocol (Pulsar) is resilient,
+   so message-level retry is not needed. If a client loses track of the stream
+   or disconnects, it must retry the full request from scratch.
+
+4. **Prompt Service Streaming**: Streaming is only supported for text (`text`)
+   responses, not structured (`object`) responses. The prompt service knows at
+   the outset whether the output will be JSON or text based on the prompt
+   template. If a streaming request is made for a JSON-output prompt, the
+   service should either:
+   - Return the complete JSON in a single response with `end_of_stream=true`, or
+   - Reject the streaming request with an error
+
+## Open Questions
+
+None at this time.
+
+## References
+
+- Current LLM schema: `trustgraph-base/trustgraph/schema/services/llm.py`
+- Current prompt schema: `trustgraph-base/trustgraph/schema/services/prompt.py`
+- Current agent schema: `trustgraph-base/trustgraph/schema/services/agent.py`
+- LLM service base: `trustgraph-base/trustgraph/base/llm_service.py`
+- VertexAI provider: `trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py`
+- Gateway API: `trustgraph-base/trustgraph/api/`
+- CLI tools: `trustgraph-cli/trustgraph/cli/`