trustgraph/docs/tech-specs/streaming-llm-responses.md
cybermaggedon 310a2deb06
Feature/streaming llm phase 1 (#566)
* Tidy up duplicate tech specs in doc directory

* Streaming LLM text-completion service tech spec.

* text-completion and prompt interfaces

* streaming change applied to all LLMs, so far tested with VertexAI

* Skip Pinecone unit tests, upstream module issue is affecting things, tests are passing again

* Added agent streaming, not working and has broken tests
2025-11-26 09:59:10 +00:00

18 KiB

Streaming LLM Responses Technical Specification

Overview

This specification describes the implementation of streaming support for LLM responses in TrustGraph. Streaming enables real-time delivery of generated tokens as they are produced by the LLM, rather than waiting for complete response generation.

This implementation supports the following use cases:

  1. Real-time User Interfaces: Stream tokens to UI as they are generated, providing immediate visual feedback
  2. Reduced Time-to-First-Token: Users see output beginning immediately rather than waiting for full generation
  3. Long Response Handling: Handle very long outputs that might otherwise timeout or exceed memory limits
  4. Interactive Applications: Enable responsive chat and agent interfaces

Goals

  • Backward Compatibility: Existing non-streaming clients continue to work without modification
  • Consistent API Design: Streaming and non-streaming use the same schema patterns with minimal divergence
  • Provider Flexibility: Support streaming where available, graceful fallback where not
  • Phased Rollout: Incremental implementation to reduce risk
  • End-to-End Support: Streaming from LLM provider through to client applications via Pulsar, Gateway API, and Python API

Background

Current Architecture

The current LLM text completion flow operates as follows:

  1. Client sends TextCompletionRequest with system and prompt fields
  2. LLM service processes the request and waits for complete generation
  3. Single TextCompletionResponse returned with complete response string

Current schema (trustgraph-base/trustgraph/schema/services/llm.py):

class TextCompletionRequest(Record):
    system = String()
    prompt = String()

class TextCompletionResponse(Record):
    error = Error()
    response = String()
    in_token = Integer()
    out_token = Integer()
    model = String()

Current Limitations

  • Latency: Users must wait for complete generation before seeing any output
  • Timeout Risk: Long generations may exceed client timeout thresholds
  • Poor UX: No feedback during generation creates perception of slowness
  • Resource Usage: Full responses must be buffered in memory

This specification addresses these limitations by enabling incremental response delivery while maintaining full backward compatibility.

Technical Design

Phase 1: Infrastructure

Phase 1 establishes the foundation for streaming by modifying schemas, APIs, and CLI tools.

Schema Changes

LLM Schema (trustgraph-base/trustgraph/schema/services/llm.py)

Request Changes:

class TextCompletionRequest(Record):
    system = String()
    prompt = String()
    streaming = Boolean()  # NEW: Default false for backward compatibility
  • streaming: When true, requests streaming response delivery
  • Default: false (existing behavior preserved)

Response Changes:

class TextCompletionResponse(Record):
    error = Error()
    response = String()
    in_token = Integer()
    out_token = Integer()
    model = String()
    end_of_stream = Boolean()  # NEW: Indicates final message
  • end_of_stream: When true, indicates this is the final (or only) response
  • For non-streaming requests: Single response with end_of_stream=true
  • For streaming requests: Multiple responses, all with end_of_stream=false except the final one
Prompt Schema (trustgraph-base/trustgraph/schema/services/prompt.py)

The prompt service wraps text completion, so it mirrors the same pattern:

Request Changes:

class PromptRequest(Record):
    id = String()
    terms = Map(String())
    streaming = Boolean()  # NEW: Default false

Response Changes:

class PromptResponse(Record):
    error = Error()
    text = String()
    object = String()
    end_of_stream = Boolean()  # NEW: Indicates final message

Gateway API Changes

The Gateway API must expose streaming capabilities to HTTP/WebSocket clients.

REST API Updates:

  • POST /api/v1/text-completion: Accept streaming parameter in request body
  • Response behavior depends on streaming flag:
    • streaming=false: Single JSON response (current behavior)
    • streaming=true: Server-Sent Events (SSE) stream or WebSocket messages

Response Format (Streaming):

Each streamed chunk follows the same schema structure:

{
  "response": "partial text...",
  "end_of_stream": false,
  "model": "model-name"
}

Final chunk:

{
  "response": "final text chunk",
  "end_of_stream": true,
  "in_token": 150,
  "out_token": 500,
  "model": "model-name"
}

Python API Changes

The Python client API must support both streaming and non-streaming modes while maintaining backward compatibility.

LlmClient Updates (trustgraph-base/trustgraph/clients/llm_client.py):

class LlmClient(BaseClient):
    def request(self, system, prompt, timeout=300, streaming=False):
        """
        Non-streaming request (backward compatible).
        Returns complete response string.
        """
        # Existing behavior when streaming=False

    async def request_stream(self, system, prompt, timeout=300):
        """
        Streaming request.
        Yields response chunks as they arrive.
        """
        # New async generator method

PromptClient Updates (trustgraph-base/trustgraph/base/prompt_client.py):

Similar pattern with streaming parameter and async generator variant.

CLI Tool Changes

tg-invoke-llm (trustgraph-cli/trustgraph/cli/invoke_llm.py):

tg-invoke-llm [system] [prompt] [--no-streaming] [-u URL] [-f flow-id]
  • Streaming enabled by default for better interactive UX
  • --no-streaming flag disables streaming
  • When streaming: Output tokens to stdout as they arrive
  • When not streaming: Wait for complete response, then output

tg-invoke-prompt (trustgraph-cli/trustgraph/cli/invoke_prompt.py):

tg-invoke-prompt [template-id] [var=value...] [--no-streaming] [-u URL] [-f flow-id]

Same pattern as tg-invoke-llm.

LLM Service Base Class Changes

LlmService (trustgraph-base/trustgraph/base/llm_service.py):

class LlmService(FlowProcessor):
    async def on_request(self, msg, consumer, flow):
        request = msg.value()
        streaming = getattr(request, 'streaming', False)

        if streaming and self.supports_streaming():
            async for chunk in self.generate_content_stream(...):
                await self.send_response(chunk, end_of_stream=False)
            await self.send_response(final_chunk, end_of_stream=True)
        else:
            response = await self.generate_content(...)
            await self.send_response(response, end_of_stream=True)

    def supports_streaming(self):
        """Override in subclass to indicate streaming support."""
        return False

    async def generate_content_stream(self, system, prompt, model, temperature):
        """Override in subclass to implement streaming."""
        raise NotImplementedError()

Phase 2: VertexAI Proof of Concept

Phase 2 implements streaming in a single provider (VertexAI) to validate the infrastructure and enable end-to-end testing.

VertexAI Implementation

Module: trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py

Changes:

  1. Override supports_streaming() to return True
  2. Implement generate_content_stream() async generator
  3. Handle both Gemini and Claude models (via VertexAI Anthropic API)

Gemini Streaming:

async def generate_content_stream(self, system, prompt, model, temperature):
    model_instance = self.get_model(model, temperature)
    response = model_instance.generate_content(
        [system, prompt],
        stream=True  # Enable streaming
    )
    for chunk in response:
        yield LlmChunk(
            text=chunk.text,
            in_token=None,  # Available only in final chunk
            out_token=None,
        )
    # Final chunk includes token counts from response.usage_metadata

Claude (via VertexAI Anthropic) Streaming:

async def generate_content_stream(self, system, prompt, model, temperature):
    with self.anthropic_client.messages.stream(...) as stream:
        for text in stream.text_stream:
            yield LlmChunk(text=text)
    # Token counts from stream.get_final_message()

Testing

  • Unit tests for streaming response assembly
  • Integration tests with VertexAI (Gemini and Claude)
  • End-to-end tests: CLI -> Gateway -> Pulsar -> VertexAI -> back
  • Backward compatibility tests: Non-streaming requests still work

Phase 3: All LLM Providers

Phase 3 extends streaming support to all LLM providers in the system.

Provider Implementation Status

Each provider must either:

  1. Full Streaming Support: Implement generate_content_stream()
  2. Compatibility Mode: Handle the end_of_stream flag correctly (return single response with end_of_stream=true)
Provider Package Streaming Support
OpenAI trustgraph-flow Full (native streaming API)
Claude/Anthropic trustgraph-flow Full (native streaming API)
Ollama trustgraph-flow Full (native streaming API)
Cohere trustgraph-flow Full (native streaming API)
Mistral trustgraph-flow Full (native streaming API)
Azure OpenAI trustgraph-flow Full (native streaming API)
Google AI Studio trustgraph-flow Full (native streaming API)
VertexAI trustgraph-vertexai Full (Phase 2)
Bedrock trustgraph-bedrock Full (native streaming API)
LM Studio trustgraph-flow Full (OpenAI-compatible)
LlamaFile trustgraph-flow Full (OpenAI-compatible)
vLLM trustgraph-flow Full (OpenAI-compatible)
TGI trustgraph-flow TBD
Azure trustgraph-flow TBD

Implementation Pattern

For OpenAI-compatible providers (OpenAI, LM Studio, LlamaFile, vLLM):

async def generate_content_stream(self, system, prompt, model, temperature):
    response = await self.client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ],
        temperature=temperature,
        stream=True
    )
    async for chunk in response:
        if chunk.choices[0].delta.content:
            yield LlmChunk(text=chunk.choices[0].delta.content)

Phase 4: Agent API

Phase 4 extends streaming to the Agent API. This is more complex because the Agent API is already multi-message by nature (thought → action → observation → repeat → final answer).

Current Agent Schema

class AgentStep(Record):
    thought = String()
    action = String()
    arguments = Map(String())
    observation = String()
    user = String()

class AgentRequest(Record):
    question = String()
    state = String()
    group = Array(String())
    history = Array(AgentStep())
    user = String()

class AgentResponse(Record):
    answer = String()
    error = Error()
    thought = String()
    observation = String()

Proposed Agent Schema Changes

Request Changes:

class AgentRequest(Record):
    question = String()
    state = String()
    group = Array(String())
    history = Array(AgentStep())
    user = String()
    streaming = Boolean()  # NEW: Default false

Response Changes:

The agent produces multiple types of output during its reasoning cycle:

  • Thoughts (reasoning)
  • Actions (tool calls)
  • Observations (tool results)
  • Answer (final response)
  • Errors

Since chunk_type identifies what kind of content is being sent, the separate answer, error, thought, and observation fields can be collapsed into a single content field:

class AgentResponse(Record):
    chunk_type = String()       # "thought", "action", "observation", "answer", "error"
    content = String()          # The actual content (interpretation depends on chunk_type)
    end_of_message = Boolean()  # Current thought/action/observation/answer is complete
    end_of_dialog = Boolean()   # Entire agent dialog is complete

Field Semantics:

  • chunk_type: Indicates what type of content is in the content field

    • "thought": Agent reasoning/thinking
    • "action": Tool/action being invoked
    • "observation": Result from tool execution
    • "answer": Final answer to the user's question
    • "error": Error message
  • content: The actual streamed content, interpreted based on chunk_type

  • end_of_message: When true, the current chunk type is complete

    • Example: All tokens for the current thought have been sent
    • Allows clients to know when to move to the next stage
  • end_of_dialog: When true, the entire agent interaction is complete

    • This is the final message in the stream

Agent Streaming Behavior

When streaming=true:

  1. Thought streaming:
    • Multiple chunks with chunk_type="thought", end_of_message=false
    • Final thought chunk has end_of_message=true
  2. Action notification:
    • Single chunk with chunk_type="action", end_of_message=true
  3. Observation:
    • Chunk(s) with chunk_type="observation", final has end_of_message=true
  4. Repeat steps 1-3 as the agent reasons
  5. Final answer:
    • chunk_type="answer" with the final response in content
    • Last chunk has end_of_message=true, end_of_dialog=true

Example Stream Sequence:

{chunk_type: "thought", content: "I need to", end_of_message: false, end_of_dialog: false}
{chunk_type: "thought", content: " search for...", end_of_message: true, end_of_dialog: false}
{chunk_type: "action", content: "search", end_of_message: true, end_of_dialog: false}
{chunk_type: "observation", content: "Found: ...", end_of_message: true, end_of_dialog: false}
{chunk_type: "thought", content: "Based on this", end_of_message: false, end_of_dialog: false}
{chunk_type: "thought", content: " I can answer...", end_of_message: true, end_of_dialog: false}
{chunk_type: "answer", content: "The answer is...", end_of_message: true, end_of_dialog: true}

When streaming=false:

  • Current behavior preserved
  • Single response with complete answer
  • end_of_message=true, end_of_dialog=true

Gateway and Python API

  • Gateway: New SSE/WebSocket endpoint for agent streaming
  • Python API: New agent_stream() async generator method

Security Considerations

  • No new attack surface: Streaming uses same authentication/authorization
  • Rate limiting: Apply per-token or per-chunk rate limits if needed
  • Connection handling: Properly terminate streams on client disconnect
  • Timeout management: Streaming requests need appropriate timeout handling

Performance Considerations

  • Memory: Streaming reduces peak memory usage (no full response buffering)
  • Latency: Time-to-first-token significantly reduced
  • Connection overhead: SSE/WebSocket connections have keep-alive overhead
  • Pulsar throughput: Multiple small messages vs. single large message tradeoff

Testing Strategy

Unit Tests

  • Schema serialization/deserialization with new fields
  • Backward compatibility (missing fields use defaults)
  • Chunk assembly logic

Integration Tests

  • Each LLM provider's streaming implementation
  • Gateway API streaming endpoints
  • Python client streaming methods

End-to-End Tests

  • CLI tool streaming output
  • Full flow: Client → Gateway → Pulsar → LLM → back
  • Mixed streaming/non-streaming workloads

Backward Compatibility Tests

  • Existing clients work without modification
  • Non-streaming requests behave identically

Migration Plan

Phase 1: Infrastructure

  • Deploy schema changes (backward compatible)
  • Deploy Gateway API updates
  • Deploy Python API updates
  • Release CLI tool updates

Phase 2: VertexAI

  • Deploy VertexAI streaming implementation
  • Validate with test workloads

Phase 3: All Providers

  • Roll out provider updates incrementally
  • Monitor for issues

Phase 4: Agent API

  • Deploy agent schema changes
  • Deploy agent streaming implementation
  • Update documentation

Timeline

Phase Description Dependencies
Phase 1 Infrastructure None
Phase 2 VertexAI PoC Phase 1
Phase 3 All Providers Phase 2
Phase 4 Agent API Phase 3

Design Decisions

The following questions were resolved during specification:

  1. Token Counts in Streaming: Token counts are deltas, not running totals. Consumers can sum them if needed. This matches how most providers report usage and simplifies the implementation.

  2. Error Handling in Streams: If an error occurs, the error field is populated and no other fields are needed. An error is always the final communication - no subsequent messages are permitted or expected after an error. For LLM/Prompt streams, end_of_stream=true. For Agent streams, chunk_type="error" with end_of_dialog=true.

  3. Partial Response Recovery: The messaging protocol (Pulsar) is resilient, so message-level retry is not needed. If a client loses track of the stream or disconnects, it must retry the full request from scratch.

  4. Prompt Service Streaming: Streaming is only supported for text (text) responses, not structured (object) responses. The prompt service knows at the outset whether the output will be JSON or text based on the prompt template. If a streaming request is made for a JSON-output prompt, the service should either:

    • Return the complete JSON in a single response with end_of_stream=true, or
    • Reject the streaming request with an error

Open Questions

None at this time.

References

  • Current LLM schema: trustgraph-base/trustgraph/schema/services/llm.py
  • Current prompt schema: trustgraph-base/trustgraph/schema/services/prompt.py
  • Current agent schema: trustgraph-base/trustgraph/schema/services/agent.py
  • LLM service base: trustgraph-base/trustgraph/base/llm_service.py
  • VertexAI provider: trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py
  • Gateway API: trustgraph-base/trustgraph/api/
  • CLI tools: trustgraph-cli/trustgraph/cli/