mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-04-25 08:26:21 +02:00

* Tidy up duplicate tech specs in doc directory

* Streaming LLM text-completion service tech spec.

* text-completion and prompt interfaces

* streaming change applied to all LLMs, so far tested with VertexAI

* Skip Pinecone unit tests, upstream module issue is affecting things, tests are passing again

* Added agent streaming, not working and has broken tests

2025-11-26 09:59:10 +00:00

18 KiB

Raw Blame History

Streaming LLM Responses Technical Specification

Overview

This specification describes the implementation of streaming support for LLM responses in TrustGraph. Streaming enables real-time delivery of generated tokens as they are produced by the LLM, rather than waiting for complete response generation.

This implementation supports the following use cases:

Real-time User Interfaces: Stream tokens to UI as they are generated, providing immediate visual feedback
Reduced Time-to-First-Token: Users see output beginning immediately rather than waiting for full generation
Long Response Handling: Handle very long outputs that might otherwise timeout or exceed memory limits
Interactive Applications: Enable responsive chat and agent interfaces

Goals

Backward Compatibility: Existing non-streaming clients continue to work without modification
Consistent API Design: Streaming and non-streaming use the same schema patterns with minimal divergence
Provider Flexibility: Support streaming where available, graceful fallback where not
Phased Rollout: Incremental implementation to reduce risk
End-to-End Support: Streaming from LLM provider through to client applications via Pulsar, Gateway API, and Python API

Background

Current Architecture

The current LLM text completion flow operates as follows:

Client sends TextCompletionRequest with system and prompt fields
LLM service processes the request and waits for complete generation
Single TextCompletionResponse returned with complete response string

Current schema (trustgraph-base/trustgraph/schema/services/llm.py):

class TextCompletionRequest(Record):
    system = String()
    prompt = String()

class TextCompletionResponse(Record):
    error = Error()
    response = String()
    in_token = Integer()
    out_token = Integer()
    model = String()

Current Limitations

Latency: Users must wait for complete generation before seeing any output
Timeout Risk: Long generations may exceed client timeout thresholds
Poor UX: No feedback during generation creates perception of slowness
Resource Usage: Full responses must be buffered in memory

This specification addresses these limitations by enabling incremental response delivery while maintaining full backward compatibility.

Technical Design

Phase 1: Infrastructure

Phase 1 establishes the foundation for streaming by modifying schemas, APIs, and CLI tools.

Schema Changes

LLM Schema (`trustgraph-base/trustgraph/schema/services/llm.py`)

Request Changes:

class TextCompletionRequest(Record):
    system = String()
    prompt = String()
    streaming = Boolean()  # NEW: Default false for backward compatibility

streaming: When true, requests streaming response delivery
Default: false (existing behavior preserved)

Response Changes:

class TextCompletionResponse(Record):
    error = Error()
    response = String()
    in_token = Integer()
    out_token = Integer()
    model = String()
    end_of_stream = Boolean()  # NEW: Indicates final message

end_of_stream: When true, indicates this is the final (or only) response
For non-streaming requests: Single response with end_of_stream=true
For streaming requests: Multiple responses, all with end_of_stream=false except the final one

Prompt Schema (`trustgraph-base/trustgraph/schema/services/prompt.py`)

The prompt service wraps text completion, so it mirrors the same pattern:

Request Changes:

class PromptRequest(Record):
    id = String()
    terms = Map(String())
    streaming = Boolean()  # NEW: Default false

Response Changes:

class PromptResponse(Record):
    error = Error()
    text = String()
    object = String()
    end_of_stream = Boolean()  # NEW: Indicates final message

Gateway API Changes

The Gateway API must expose streaming capabilities to HTTP/WebSocket clients.

REST API Updates:

POST /api/v1/text-completion: Accept streaming parameter in request body
Response behavior depends on streaming flag:
- streaming=false: Single JSON response (current behavior)
- streaming=true: Server-Sent Events (SSE) stream or WebSocket messages

Response Format (Streaming):

Each streamed chunk follows the same schema structure:

{
  "response": "partial text...",
  "end_of_stream": false,
  "model": "model-name"
}

Final chunk:

{
  "response": "final text chunk",
  "end_of_stream": true,
  "in_token": 150,
  "out_token": 500,
  "model": "model-name"
}

Python API Changes

The Python client API must support both streaming and non-streaming modes while maintaining backward compatibility.

LlmClient Updates (trustgraph-base/trustgraph/clients/llm_client.py):

class LlmClient(BaseClient):
    def request(self, system, prompt, timeout=300, streaming=False):
        """
        Non-streaming request (backward compatible).
        Returns complete response string.
        """
        # Existing behavior when streaming=False

    async def request_stream(self, system, prompt, timeout=300):
        """
        Streaming request.
        Yields response chunks as they arrive.
        """
        # New async generator method

PromptClient Updates (trustgraph-base/trustgraph/base/prompt_client.py):

Similar pattern with streaming parameter and async generator variant.

CLI Tool Changes

tg-invoke-llm (trustgraph-cli/trustgraph/cli/invoke_llm.py):

tg-invoke-llm [system] [prompt] [--no-streaming] [-u URL] [-f flow-id]

Streaming enabled by default for better interactive UX
--no-streaming flag disables streaming
When streaming: Output tokens to stdout as they arrive
When not streaming: Wait for complete response, then output

tg-invoke-prompt (trustgraph-cli/trustgraph/cli/invoke_prompt.py):

tg-invoke-prompt [template-id] [var=value...] [--no-streaming] [-u URL] [-f flow-id]

Same pattern as tg-invoke-llm.

LLM Service Base Class Changes

LlmService (trustgraph-base/trustgraph/base/llm_service.py):

class LlmService(FlowProcessor):
    async def on_request(self, msg, consumer, flow):
        request = msg.value()
        streaming = getattr(request, 'streaming', False)

        if streaming and self.supports_streaming():
            async for chunk in self.generate_content_stream(...):
                await self.send_response(chunk, end_of_stream=False)
            await self.send_response(final_chunk, end_of_stream=True)
        else:
            response = await self.generate_content(...)
            await self.send_response(response, end_of_stream=True)

    def supports_streaming(self):
        """Override in subclass to indicate streaming support."""
        return False

    async def generate_content_stream(self, system, prompt, model, temperature):
        """Override in subclass to implement streaming."""
        raise NotImplementedError()

Phase 2: VertexAI Proof of Concept

Phase 2 implements streaming in a single provider (VertexAI) to validate the infrastructure and enable end-to-end testing.

VertexAI Implementation

Module: trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py

Changes:

Override supports_streaming() to return True
Implement generate_content_stream() async generator
Handle both Gemini and Claude models (via VertexAI Anthropic API)

Gemini Streaming:

async def generate_content_stream(self, system, prompt, model, temperature):
    model_instance = self.get_model(model, temperature)
    response = model_instance.generate_content(
        [system, prompt],
        stream=True  # Enable streaming
    )
    for chunk in response:
        yield LlmChunk(
            text=chunk.text,
            in_token=None,  # Available only in final chunk
            out_token=None,
        )
    # Final chunk includes token counts from response.usage_metadata

Claude (via VertexAI Anthropic) Streaming:

async def generate_content_stream(self, system, prompt, model, temperature):
    with self.anthropic_client.messages.stream(...) as stream:
        for text in stream.text_stream:
            yield LlmChunk(text=text)
    # Token counts from stream.get_final_message()

Testing

Unit tests for streaming response assembly
Integration tests with VertexAI (Gemini and Claude)
End-to-end tests: CLI -> Gateway -> Pulsar -> VertexAI -> back
Backward compatibility tests: Non-streaming requests still work

Phase 3: All LLM Providers

Phase 3 extends streaming support to all LLM providers in the system.

Provider Implementation Status

Each provider must either:

Full Streaming Support: Implement generate_content_stream()
Compatibility Mode: Handle the end_of_stream flag correctly (return single response with end_of_stream=true)

Provider	Package	Streaming Support
OpenAI	trustgraph-flow	Full (native streaming API)
Claude/Anthropic	trustgraph-flow	Full (native streaming API)
Ollama	trustgraph-flow	Full (native streaming API)
Cohere	trustgraph-flow	Full (native streaming API)
Mistral	trustgraph-flow	Full (native streaming API)
Azure OpenAI	trustgraph-flow	Full (native streaming API)
Google AI Studio	trustgraph-flow	Full (native streaming API)
VertexAI	trustgraph-vertexai	Full (Phase 2)
Bedrock	trustgraph-bedrock	Full (native streaming API)
LM Studio	trustgraph-flow	Full (OpenAI-compatible)
LlamaFile	trustgraph-flow	Full (OpenAI-compatible)
vLLM	trustgraph-flow	Full (OpenAI-compatible)
TGI	trustgraph-flow	TBD
Azure	trustgraph-flow	TBD

Implementation Pattern

For OpenAI-compatible providers (OpenAI, LM Studio, LlamaFile, vLLM):

async def generate_content_stream(self, system, prompt, model, temperature):
    response = await self.client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ],
        temperature=temperature,
        stream=True
    )
    async for chunk in response:
        if chunk.choices[0].delta.content:
            yield LlmChunk(text=chunk.choices[0].delta.content)

Phase 4: Agent API

Phase 4 extends streaming to the Agent API. This is more complex because the Agent API is already multi-message by nature (thought → action → observation → repeat → final answer).

Current Agent Schema

class AgentStep(Record):
    thought = String()
    action = String()
    arguments = Map(String())
    observation = String()
    user = String()

class AgentRequest(Record):
    question = String()
    state = String()
    group = Array(String())
    history = Array(AgentStep())
    user = String()

class AgentResponse(Record):
    answer = String()
    error = Error()
    thought = String()
    observation = String()

Proposed Agent Schema Changes

Request Changes:

class AgentRequest(Record):
    question = String()
    state = String()
    group = Array(String())
    history = Array(AgentStep())
    user = String()
    streaming = Boolean()  # NEW: Default false

Response Changes:

The agent produces multiple types of output during its reasoning cycle:

Thoughts (reasoning)
Actions (tool calls)
Observations (tool results)
Answer (final response)
Errors

Since chunk_type identifies what kind of content is being sent, the separate answer, error, thought, and observation fields can be collapsed into a single content field:

class AgentResponse(Record):
    chunk_type = String()       # "thought", "action", "observation", "answer", "error"
    content = String()          # The actual content (interpretation depends on chunk_type)
    end_of_message = Boolean()  # Current thought/action/observation/answer is complete
    end_of_dialog = Boolean()   # Entire agent dialog is complete

Field Semantics:

chunk_type: Indicates what type of content is in the content field
- "thought": Agent reasoning/thinking
- "action": Tool/action being invoked
- "observation": Result from tool execution
- "answer": Final answer to the user's question
- "error": Error message
content: The actual streamed content, interpreted based on chunk_type
end_of_message: When true, the current chunk type is complete
- Example: All tokens for the current thought have been sent
- Allows clients to know when to move to the next stage
end_of_dialog: When true, the entire agent interaction is complete
- This is the final message in the stream

Agent Streaming Behavior

When streaming=true:

Thought streaming:
- Multiple chunks with chunk_type="thought", end_of_message=false
- Final thought chunk has end_of_message=true
Action notification:
- Single chunk with chunk_type="action", end_of_message=true
Observation:
- Chunk(s) with chunk_type="observation", final has end_of_message=true
Repeat steps 1-3 as the agent reasons
Final answer:
- chunk_type="answer" with the final response in content
- Last chunk has end_of_message=true, end_of_dialog=true

Example Stream Sequence:

{chunk_type: "thought", content: "I need to", end_of_message: false, end_of_dialog: false}
{chunk_type: "thought", content: " search for...", end_of_message: true, end_of_dialog: false}
{chunk_type: "action", content: "search", end_of_message: true, end_of_dialog: false}
{chunk_type: "observation", content: "Found: ...", end_of_message: true, end_of_dialog: false}
{chunk_type: "thought", content: "Based on this", end_of_message: false, end_of_dialog: false}
{chunk_type: "thought", content: " I can answer...", end_of_message: true, end_of_dialog: false}
{chunk_type: "answer", content: "The answer is...", end_of_message: true, end_of_dialog: true}

When streaming=false:

Current behavior preserved
Single response with complete answer
end_of_message=true, end_of_dialog=true

Gateway and Python API

Gateway: New SSE/WebSocket endpoint for agent streaming
Python API: New agent_stream() async generator method

Security Considerations

No new attack surface: Streaming uses same authentication/authorization
Rate limiting: Apply per-token or per-chunk rate limits if needed
Connection handling: Properly terminate streams on client disconnect
Timeout management: Streaming requests need appropriate timeout handling

Performance Considerations

Memory: Streaming reduces peak memory usage (no full response buffering)
Latency: Time-to-first-token significantly reduced
Connection overhead: SSE/WebSocket connections have keep-alive overhead
Pulsar throughput: Multiple small messages vs. single large message tradeoff

Testing Strategy

Unit Tests

Schema serialization/deserialization with new fields
Backward compatibility (missing fields use defaults)
Chunk assembly logic

Integration Tests

Each LLM provider's streaming implementation
Gateway API streaming endpoints
Python client streaming methods

End-to-End Tests

CLI tool streaming output
Full flow: Client → Gateway → Pulsar → LLM → back
Mixed streaming/non-streaming workloads

Backward Compatibility Tests

Existing clients work without modification
Non-streaming requests behave identically

Migration Plan

Phase 1: Infrastructure

Deploy schema changes (backward compatible)
Deploy Gateway API updates
Deploy Python API updates
Release CLI tool updates

Phase 2: VertexAI

Deploy VertexAI streaming implementation
Validate with test workloads

Phase 3: All Providers

Roll out provider updates incrementally
Monitor for issues

Phase 4: Agent API

Deploy agent schema changes
Deploy agent streaming implementation
Update documentation

Timeline

Phase	Description	Dependencies
Phase 1	Infrastructure	None
Phase 2	VertexAI PoC	Phase 1
Phase 3	All Providers	Phase 2
Phase 4	Agent API	Phase 3

Design Decisions

The following questions were resolved during specification:

Token Counts in Streaming: Token counts are deltas, not running totals. Consumers can sum them if needed. This matches how most providers report usage and simplifies the implementation.
Error Handling in Streams: If an error occurs, the error field is populated and no other fields are needed. An error is always the final communication - no subsequent messages are permitted or expected after an error. For LLM/Prompt streams, end_of_stream=true. For Agent streams, chunk_type="error" with end_of_dialog=true.
Partial Response Recovery: The messaging protocol (Pulsar) is resilient, so message-level retry is not needed. If a client loses track of the stream or disconnects, it must retry the full request from scratch.
Prompt Service Streaming: Streaming is only supported for text (text) responses, not structured (object) responses. The prompt service knows at the outset whether the output will be JSON or text based on the prompt template. If a streaming request is made for a JSON-output prompt, the service should either:
- Return the complete JSON in a single response with end_of_stream=true, or
- Reject the streaming request with an error

Open Questions

None at this time.

References

Current LLM schema: trustgraph-base/trustgraph/schema/services/llm.py
Current prompt schema: trustgraph-base/trustgraph/schema/services/prompt.py
Current agent schema: trustgraph-base/trustgraph/schema/services/agent.py
LLM service base: trustgraph-base/trustgraph/base/llm_service.py
VertexAI provider: trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py
Gateway API: trustgraph-base/trustgraph/api/
CLI tools: trustgraph-cli/trustgraph/cli/

18 KiB Raw Blame History

Streaming LLM Responses Technical Specification

Overview

Goals

Background

Current Architecture

Current Limitations

Technical Design

Phase 1: Infrastructure

Schema Changes

LLM Schema (trustgraph-base/trustgraph/schema/services/llm.py)

Prompt Schema (trustgraph-base/trustgraph/schema/services/prompt.py)

Gateway API Changes

Python API Changes

CLI Tool Changes

LLM Service Base Class Changes

Phase 2: VertexAI Proof of Concept

VertexAI Implementation

Testing

Phase 3: All LLM Providers

Provider Implementation Status

Implementation Pattern

Phase 4: Agent API

Current Agent Schema

Proposed Agent Schema Changes

Agent Streaming Behavior

Gateway and Python API

Security Considerations

Performance Considerations

Testing Strategy

Unit Tests

Integration Tests

End-to-End Tests

Backward Compatibility Tests

Migration Plan

Phase 1: Infrastructure

Phase 2: VertexAI

Phase 3: All Providers

Phase 4: Agent API

Timeline

Design Decisions

Open Questions

References

18 KiB

Raw Blame History

LLM Schema (`trustgraph-base/trustgraph/schema/services/llm.py`)

Prompt Schema (`trustgraph-base/trustgraph/schema/services/prompt.py`)