Merge branch 'release/v1.6'

2026-05-18 11:55:12 +02:00 · 2025-12-03 09:50:56 +00:00 · 2025-12-03 09:50:56 +00:00 · 72bd677086
commit 72bd677086
parent a9dacda357 c808d26b0b
73 changed files with 6419 additions and 1109 deletions
--- a/.github/workflows/pull-request.yaml
+++ b/.github/workflows/pull-request.yaml
@ -22,7 +22,7 @@ jobs:
        uses: actions/checkout@v3
      - name: Setup packages
-        run: make update-package-versions VERSION=1.5.999
+        run: make update-package-versions VERSION=1.6.999
      - name: Setup environment
        run: python3 -m venv env
--- a/docs/tech-specs/ARCHITECTURE_PRINCIPLES.md
+++ b/docs/tech-specs/ARCHITECTURE_PRINCIPLES.md
@ -1,106 +0,0 @@
 # Knowledge Graph Architecture Foundations
 ## Foundation 1: Subject-Predicate-Object (SPO) Graph Model
 **Decision**: Adopt SPO/RDF as the core knowledge representation model
 **Rationale**:
 - Provides maximum flexibility and interoperability with existing graph technologies
 - Enables seamless translation to other graph query languages (e.g., SPO → Cypher, but not vice versa)
 - Creates a foundation that "unlocks a lot" of downstream capabilities
 - Supports both node-to-node relationships (SPO) and node-to-literal relationships (RDF)
 **Implementation**: 
 - Core data structure: `node → edge → {node | literal}`
 - Maintain compatibility with RDF standards while supporting extended SPO operations
 ## Foundation 2: LLM-Native Knowledge Graph Integration
 **Decision**: Optimize knowledge graph structure and operations for LLM interaction
 **Rationale**:
 - Primary use case involves LLMs interfacing with knowledge graphs
 - Graph technology choices must prioritize LLM compatibility over other considerations
 - Enables natural language processing workflows that leverage structured knowledge
 **Implementation**:
 - Design graph schemas that LLMs can effectively reason about
 - Optimize for common LLM interaction patterns
 ## Foundation 3: Embedding-Based Graph Navigation
 **Decision**: Implement direct mapping from natural language queries to graph nodes via embeddings
 **Rationale**:
 - Enables the simplest possible path from NLP query to graph navigation
 - Avoids complex intermediate query generation steps
 - Provides efficient semantic search capabilities within the graph structure
 **Implementation**:
 - `NLP Query → Graph Embeddings → Graph Nodes`
 - Maintain embedding representations for all graph entities
 - Support direct semantic similarity matching for query resolution
 ## Foundation 4: Distributed Entity Resolution with Deterministic Identifiers
 **Decision**: Support parallel knowledge extraction with deterministic entity identification (80% rule)
 **Rationale**:
 - **Ideal**: Single-process extraction with complete state visibility enables perfect entity resolution
 - **Reality**: Scalability requirements demand parallel processing capabilities
 - **Compromise**: Design for deterministic entity identification across distributed processes
 **Implementation**:
 - Develop mechanisms for generating consistent, unique identifiers across different knowledge extractors
 - Same entity mentioned in different processes must resolve to the same identifier
 - Acknowledge that ~20% of edge cases may require alternative processing models
 - Design fallback mechanisms for complex entity resolution scenarios
 ## Foundation 5: Event-Driven Architecture with Publish-Subscribe
 **Decision**: Implement pub-sub messaging system for system coordination
 **Rationale**:
 - Enables loose coupling between knowledge extraction, storage, and query components
 - Supports real-time updates and notifications across the system
 - Facilitates scalable, distributed processing workflows
 **Implementation**:
 - Message-driven coordination between system components
 - Event streams for knowledge updates, extraction completion, and query results
 ## Foundation 6: Reentrant Agent Communication
 **Decision**: Support reentrant pub-sub operations for agent-based processing
 **Rationale**:
 - Enables sophisticated agent workflows where agents can trigger and respond to each other
 - Supports complex, multi-step knowledge processing pipelines
 - Allows for recursive and iterative processing patterns
 **Implementation**:
 - Pub-sub system must handle reentrant calls safely
 - Agent coordination mechanisms that prevent infinite loops
 - Support for agent workflow orchestration
 ## Foundation 7: Columnar Data Store Integration
 **Decision**: Ensure query compatibility with columnar storage systems
 **Rationale**:
 - Enables efficient analytical queries over large knowledge datasets
 - Supports business intelligence and reporting use cases
 - Bridges graph-based knowledge representation with traditional analytical workflows
 **Implementation**:
 - Query translation layer: Graph queries → Columnar queries
 - Hybrid storage strategy supporting both graph operations and analytical workloads
 - Maintain query performance across both paradigms
 ---
 ## Architecture Principles Summary
 1. **Flexibility First**: SPO/RDF model provides maximum adaptability
 2. **LLM Optimization**: All design decisions consider LLM interaction requirements
 3. **Semantic Efficiency**: Direct embedding-to-node mapping for optimal query performance
 4. **Pragmatic Scalability**: Balance perfect accuracy with practical distributed processing
 5. **Event-Driven Coordination**: Pub-sub enables loose coupling and scalability
 6. **Agent-Friendly**: Support complex, multi-agent processing workflows
 7. **Analytical Compatibility**: Bridge graph and columnar paradigms for comprehensive querying
 These foundations establish a knowledge graph architecture that balances theoretical rigor with practical scalability requirements, optimized for LLM integration and distributed processing.
--- a/docs/tech-specs/LOGGING_STRATEGY.md
+++ b/docs/tech-specs/LOGGING_STRATEGY.md
@ -1,169 +0,0 @@
 # TrustGraph Logging Strategy
 ## Overview
 TrustGraph uses Python's built-in `logging` module for all logging operations. This provides a standardized, flexible approach to logging across all components of the system.
 ## Default Configuration
 ### Logging Level
 - **Default Level**: `INFO`
 - **Debug Mode**: `DEBUG` (enabled via command-line argument)
 - **Production**: `WARNING` or `ERROR` as appropriate
 ### Output Destination
 All logs should be written to **standard output (stdout)** to ensure compatibility with containerized environments and log aggregation systems.
 ## Implementation Guidelines
 ### 1. Logger Initialization
 Each module should create its own logger using the module's `__name__`:
 ```python
 import logging
 logger = logging.getLogger(__name__)
 ```
 ### 2. Centralized Configuration
 The logging configuration should be centralized in `async_processor.py` (or a dedicated logging configuration module) since it's inherited by much of the codebase:
 ```python
 import logging
 import argparse
 def setup_logging(log_level='INFO'):
    """Configure logging for the entire application"""
    logging.basicConfig(
        level=getattr(logging, log_level.upper()),
        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
        handlers=[logging.StreamHandler()]
    )
 def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--log-level',
        default='INFO',
        choices=['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'],
        help='Set the logging level (default: INFO)'
    )
    return parser.parse_args()
 # In main execution
 if __name__ == '__main__':
    args = parse_args()
    setup_logging(args.log_level)
 ```
 ### 3. Logging Best Practices
 #### Log Levels Usage
 - **DEBUG**: Detailed information for diagnosing problems (variable values, function entry/exit)
 - **INFO**: General informational messages (service started, configuration loaded, processing milestones)
 - **WARNING**: Warning messages for potentially harmful situations (deprecated features, recoverable errors)
 - **ERROR**: Error messages for serious problems (failed operations, exceptions)
 - **CRITICAL**: Critical messages for system failures requiring immediate attention
 #### Message Format
 ```python
 # Good - includes context
 logger.info(f"Processing document: {doc_id}, size: {doc_size} bytes")
 logger.error(f"Failed to connect to database: {error}", exc_info=True)
 # Avoid - lacks context
 logger.info("Processing document")
 logger.error("Connection failed")
 ```
 #### Performance Considerations
 ```python
 # Use lazy formatting for expensive operations
 logger.debug("Expensive operation result: %s", expensive_function())
 # Check log level for very expensive debug operations
 if logger.isEnabledFor(logging.DEBUG):
    debug_data = compute_expensive_debug_info()
    logger.debug(f"Debug data: {debug_data}")
 ```
 ### 4. Structured Logging
 For complex data, use structured logging:
 ```python
 logger.info("Request processed", extra={
    'request_id': request_id,
    'duration_ms': duration,
    'status_code': status_code,
    'user_id': user_id
 })
 ```
 ### 5. Exception Logging
 Always include stack traces for exceptions:
 ```python
 try:
    process_data()
 except Exception as e:
    logger.error(f"Failed to process data: {e}", exc_info=True)
    raise
 ```
 ### 6. Async Logging Considerations
 For async code, ensure thread-safe logging:
 ```python
 import asyncio
 import logging
 async def async_operation():
    logger = logging.getLogger(__name__)
    logger.info(f"Starting async operation in task: {asyncio.current_task().get_name()}")
 ```
 ## Environment Variables
 Support environment-based configuration as a fallback:
 ```python
 import os
 log_level = os.environ.get('TRUSTGRAPH_LOG_LEVEL', 'INFO')
 ```
 ## Testing
 During tests, consider using a different logging configuration:
 ```python
 # In test setup
 logging.getLogger().setLevel(logging.WARNING)  # Reduce noise during tests
 ```
 ## Monitoring Integration
 Ensure log format is compatible with monitoring tools:
 - Include timestamps in ISO format
 - Use consistent field names
 - Include correlation IDs where applicable
 - Structure logs for easy parsing (JSON format for production)
 ## Security Considerations
 - Never log sensitive information (passwords, API keys, personal data)
 - Sanitize user input before logging
 - Use placeholders for sensitive fields: `user_id=****1234`
 ## Migration Path
 For existing code using print statements:
 1. Replace `print()` with appropriate logger calls
 2. Choose appropriate log levels based on message importance
 3. Add context to make logs more useful
 4. Test logging output at different levels
--- a/docs/tech-specs/SCHEMA_REFACTORING_PROPOSAL.md
+++ b/docs/tech-specs/SCHEMA_REFACTORING_PROPOSAL.md
@ -1,91 +0,0 @@
 # Schema Directory Refactoring Proposal
 ## Current Issues
 1. **Flat structure** - All schemas in one directory makes it hard to understand relationships
 2. **Mixed concerns** - Core types, domain objects, and API contracts all mixed together
 3. **Unclear naming** - Files like "object.py", "types.py", "topic.py" don't clearly indicate their purpose
 4. **No clear layering** - Can't easily see what depends on what
 ## Proposed Structure
 ```
 trustgraph-base/trustgraph/schema/
 ├── __init__.py
 ├── core/              # Core primitive types used everywhere
 │   ├── __init__.py
 │   ├── primitives.py  # Error, Value, Triple, Field, RowSchema
 │   ├── metadata.py    # Metadata record
 │   └── topic.py       # Topic utilities
 │
 ├── knowledge/         # Knowledge domain models and extraction
 │   ├── __init__.py
 │   ├── graph.py       # EntityContext, EntityEmbeddings, Triples
 │   ├── document.py    # Document, TextDocument, Chunk
 │   ├── knowledge.py   # Knowledge extraction types
 │   ├── embeddings.py  # All embedding-related types (moved from multiple files)
 │   └── nlp.py         # Definition, Topic, Relationship, Fact types
 │
 └── services/          # Service request/response contracts
    ├── __init__.py
    ├── llm.py         # TextCompletion, Embeddings, Tool requests/responses
    ├── retrieval.py   # GraphRAG, DocumentRAG queries/responses
    ├── query.py       # GraphEmbeddingsRequest/Response, DocumentEmbeddingsRequest/Response
    ├── agent.py       # Agent requests/responses
    ├── flow.py        # Flow requests/responses
    ├── prompt.py      # Prompt service requests/responses
    ├── config.py      # Configuration service
    ├── library.py     # Librarian service
    └── lookup.py      # Lookup service
 ```
 ## Key Changes
 1. **Hierarchical organization** - Clear separation between core types, knowledge models, and service contracts
 2. **Better naming**:
   - `types.py` → `core/primitives.py` (clearer purpose)
   - `object.py` → Split between appropriate files based on actual content
   - `documents.py` → `knowledge/document.py` (singular, consistent)
   - `models.py` → `services/llm.py` (clearer what kind of models)
   - `prompt.py` → Split: service parts to `services/prompt.py`, data types to `knowledge/nlp.py`
 3. **Logical grouping**:
   - All embedding types consolidated in `knowledge/embeddings.py`
   - All LLM-related service contracts in `services/llm.py`
   - Clear separation of request/response pairs in services directory
   - Knowledge extraction types grouped with other knowledge domain models
 4. **Dependency clarity**:
   - Core types have no dependencies
   - Knowledge models depend only on core
   - Service contracts can depend on both core and knowledge models
 ## Migration Benefits
 1. **Easier navigation** - Developers can quickly find what they need
 2. **Better modularity** - Clear boundaries between different concerns
 3. **Simpler imports** - More intuitive import paths
 4. **Future-proof** - Easy to add new knowledge types or services without cluttering
 ## Example Import Changes
 ```python
 # Before
 from trustgraph.schema import Error, Triple, GraphEmbeddings, TextCompletionRequest
 # After
 from trustgraph.schema.core import Error, Triple
 from trustgraph.schema.knowledge import GraphEmbeddings
 from trustgraph.schema.services import TextCompletionRequest
 ```
 ## Implementation Notes
 1. Keep backward compatibility by maintaining imports in root `__init__.py`
 2. Move files gradually, updating imports as needed
 3. Consider adding a `legacy.py` that imports everything for transition period
 4. Update documentation to reflect new structure
 <function_calls>
 <invoke name="TodoWrite">
 <parameter name="todos">[{"id": "1", "content": "Examine current schema directory structure", "status": "completed", "priority": "high"}, {"id": "2", "content": "Analyze schema files and their purposes", "status": "completed", "priority": "high"}, {"id": "3", "content": "Propose improved naming and structure", "status": "completed", "priority": "high"}]
--- a/docs/tech-specs/STRUCTURED_DATA.md
+++ b/docs/tech-specs/STRUCTURED_DATA.md
@ -1,253 +0,0 @@
 # Structured Data Technical Specification
 ## Overview
 This specification describes the integration of TrustGraph with structured data flows, enabling the system to work with data that can be represented as rows in tables or objects in object stores. The integration supports four primary use cases:
 1. **Unstructured to Structured Extraction**: Read unstructured data sources, identify and extract object structures, and store them in a tabular format
 2. **Structured Data Ingestion**: Load data that is already in structured formats directly into the structured store alongside extracted data
 3. **Natural Language Querying**: Convert natural language questions into structured queries to extract matching data from the store
 4. **Direct Structured Querying**: Execute structured queries directly against the data store for precise data retrieval
 ## Goals
 - **Unified Data Access**: Provide a single interface for accessing both structured and unstructured data within TrustGraph
 - **Seamless Integration**: Enable smooth interoperability between TrustGraph's graph-based knowledge representation and traditional structured data formats
 - **Flexible Extraction**: Support automatic extraction of structured data from various unstructured sources (documents, text, etc.)
 - **Query Versatility**: Allow users to query data using both natural language and structured query languages
 - **Data Consistency**: Maintain data integrity and consistency across different data representations
 - **Performance Optimization**: Ensure efficient storage and retrieval of structured data at scale
 - **Schema Flexibility**: Support both schema-on-write and schema-on-read approaches to accommodate diverse data sources
 - **Backwards Compatibility**: Preserve existing TrustGraph functionality while adding structured data capabilities
 ## Background
 TrustGraph currently excels at processing unstructured data and building knowledge graphs from diverse sources. However, many enterprise use cases involve data that is inherently structured - customer records, transaction logs, inventory databases, and other tabular datasets. These structured datasets often need to be analyzed alongside unstructured content to provide comprehensive insights.
 Current limitations include:
 - No native support for ingesting pre-structured data formats (CSV, JSON arrays, database exports)
 - Inability to preserve the inherent structure when extracting tabular data from documents
 - Lack of efficient querying mechanisms for structured data patterns
 - Missing bridge between SQL-like queries and TrustGraph's graph queries
 This specification addresses these gaps by introducing a structured data layer that complements TrustGraph's existing capabilities. By supporting structured data natively, TrustGraph can:
 - Serve as a unified platform for both structured and unstructured data analysis
 - Enable hybrid queries that span both graph relationships and tabular data
 - Provide familiar interfaces for users accustomed to working with structured data
 - Unlock new use cases in data integration and business intelligence
 ## Technical Design
 ### Architecture
 The structured data integration requires the following technical components:
 1. **NLP-to-Structured-Query Service**
   - Converts natural language questions into structured queries
   - Supports multiple query language targets (initially SQL-like syntax)
   - Integrates with existing TrustGraph NLP capabilities
   Module: trustgraph-flow/trustgraph/query/nlp_query/cassandra
 2. **Configuration Schema Support** ✅ **[COMPLETE]**
   - Extended configuration system to store structured data schemas
   - Support for defining table structures, field types, and relationships
   - Schema versioning and migration capabilities
 3. **Object Extraction Module** ✅ **[COMPLETE]**
   - Enhanced knowledge extractor flow integration
   - Identifies and extracts structured objects from unstructured sources
   - Maintains provenance and confidence scores
   - Registers a config handler (example: trustgraph-flow/trustgraph/prompt/template/service.py) to receive config data and decode schema information
   - Receives objects and decodes them to ExtractedObject objects for delivery on the Pulsar queue
   - NOTE: There's existing code at `trustgraph-flow/trustgraph/extract/object/row/`. This was a previous attempt and will need to be majorly refactored as it doesn't conform to current APIs. Use it if it's useful, start from scratch if not.
   - Requires a command-line interface: `kg-extract-objects`
   Module: trustgraph-flow/trustgraph/extract/kg/objects/
 4. **Structured Store Writer Module** ✅ **[COMPLETE]**
   - Receives objects in ExtractedObject format from Pulsar queues
   - Initial implementation targeting Apache Cassandra as the structured data store
   - Handles dynamic table creation based on schemas encountered
   - Manages schema-to-Cassandra table mapping and data transformation
   - Provides batch and streaming write operations for performance optimization
   - No Pulsar outputs - this is a terminal service in the data flow
   **Schema Handling**:
   - Monitors incoming ExtractedObject messages for schema references
   - When a new schema is encountered for the first time, automatically creates the corresponding Cassandra table
   - Maintains a cache of known schemas to avoid redundant table creation attempts
   - Should consider whether to receive schema definitions directly or rely on schema names in ExtractedObject messages
   **Cassandra Table Mapping**:
   - Keyspace is named after the `user` field from ExtractedObject's Metadata
   - Table is named after the `schema_name` field from ExtractedObject
   - Collection from Metadata becomes part of the partition key to ensure:
     - Natural data distribution across Cassandra nodes
     - Efficient queries within a specific collection
     - Logical isolation between different data imports/sources
   - Primary key structure: `PRIMARY KEY ((collection, <schema_primary_key_fields>), <clustering_keys>)`
     - Collection is always the first component of the partition key
     - Schema-defined primary key fields follow as part of the composite partition key
     - This requires queries to specify the collection, ensuring predictable performance
   - Field definitions map to Cassandra columns with type conversions:
     - `string` → `text`
     - `integer` → `int` or `bigint` based on size hint
     - `float` → `float` or `double` based on precision needs
     - `boolean` → `boolean`
     - `timestamp` → `timestamp`
     - `enum` → `text` with application-level validation
   - Indexed fields create Cassandra secondary indexes (excluding fields already in the primary key)
   - Required fields are enforced at the application level (Cassandra doesn't support NOT NULL)
   **Object Storage**:
   - Extracts values from ExtractedObject.values map
   - Performs type conversion and validation before insertion
   - Handles missing optional fields gracefully
   - Maintains metadata about object provenance (source document, confidence scores)
   - Supports idempotent writes to handle message replay scenarios
   **Implementation Notes**:
   - Existing code at `trustgraph-flow/trustgraph/storage/objects/cassandra/` is outdated and doesn't comply with current APIs
   - Should reference `trustgraph-flow/trustgraph/storage/triples/cassandra` as an example of a working storage processor
   - Needs evaluation of existing code for any reusable components before deciding to refactor or rewrite
   Module: trustgraph-flow/trustgraph/storage/objects/cassandra
 5. **Structured Query Service**
   - Accepts structured queries in defined formats
   - Executes queries against the structured store
   - Returns objects matching query criteria
   - Supports pagination and result filtering
   Module: trustgraph-flow/trustgraph/query/objects/cassandra
 6. **Agent Tool Integration**
   - New tool class for agent frameworks
   - Enables agents to query structured data stores
   - Provides natural language and structured query interfaces
   - Integrates with existing agent decision-making processes
 7. **Structured Data Ingestion Service**
   - Accepts structured data in multiple formats (JSON, CSV, XML)
   - Parses and validates incoming data against defined schemas
   - Converts data into normalized object streams
   - Emits objects to appropriate message queues for processing
   - Supports bulk uploads and streaming ingestion
   Module: trustgraph-flow/trustgraph/decoding/structured
 8. **Object Embedding Service**
   - Generates vector embeddings for structured objects
   - Enables semantic search across structured data
   - Supports hybrid search combining structured queries with semantic similarity
   - Integrates with existing vector stores
   Module: trustgraph-flow/trustgraph/embeddings/object_embeddings/qdrant
 ### Data Models
 #### Schema Storage Mechanism
 Schemas are stored in TrustGraph's configuration system using the following structure:
 - **Type**: `schema` (fixed value for all structured data schemas)
 - **Key**: The unique name/identifier of the schema (e.g., `customer_records`, `transaction_log`)
 - **Value**: JSON schema definition containing the structure
 Example configuration entry:
 ```
 Type: schema
 Key: customer_records
 Value: {
  "name": "customer_records",
  "description": "Customer information table",
  "fields": [
    {
      "name": "customer_id",
      "type": "string",
      "primary_key": true
    },
    {
      "name": "name",
      "type": "string",
      "required": true
    },
    {
      "name": "email",
      "type": "string",
      "required": true
    },
    {
      "name": "registration_date",
      "type": "timestamp"
    },
    {
      "name": "status",
      "type": "string",
      "enum": ["active", "inactive", "suspended"]
    }
  ],
  "indexes": ["email", "registration_date"]
 }
 ```
 This approach allows:
 - Dynamic schema definition without code changes
 - Easy schema updates and versioning
 - Consistent integration with existing TrustGraph configuration management
 - Support for multiple schemas within a single deployment
 ### APIs
 New APIs:
  - Pulsar schemas for above types
  - Pulsar interfaces in new flows
  - Need a means to specify schema types in flows so that flows know which
    schema types to load
  - APIs added to gateway and rev-gateway
 Modified APIs:
 - Knowledge extraction endpoints - Add structured object output option
 - Agent endpoints - Add structured data tool support
 ### Implementation Details
 Following existing conventions - these are just new processing modules.
 Everything is in the trustgraph-flow packages except for schema items
 in trustgraph-base.
 Need some UI work in the Workbench to be able to demo / pilot this
 capability.
 ## Security Considerations
 No extra considerations.
 ## Performance Considerations
 Some questions around using Cassandra queries and indexes so that queries
 don't slow down.
 ## Testing Strategy
 Use existing test strategy, will build unit, contract and integration tests.
 ## Migration Plan
 None.
 ## Timeline
 Not specified.
 ## Open Questions
 - Can this be made to work with other store types?  We're aiming to use
  interfaces which make modules which work with one store applicable to
  other stores.
 ## References
 n/a.
--- a/docs/tech-specs/STRUCTURED_DATA_SCHEMAS.md
+++ b/docs/tech-specs/STRUCTURED_DATA_SCHEMAS.md
@ -1,139 +0,0 @@
 # Structured Data Pulsar Schema Changes
 ## Overview
 Based on the STRUCTURED_DATA.md specification, this document proposes the necessary Pulsar schema additions and modifications to support structured data capabilities in TrustGraph.
 ## Required Schema Changes
 ### 1. Core Schema Enhancements
 #### Enhanced Field Definition
 The existing `Field` class in `core/primitives.py` needs additional properties:
 ```python
 class Field(Record):
    name = String()
    type = String()  # int, string, long, bool, float, double, timestamp
    size = Integer()
    primary = Boolean()
    description = String()
    # NEW FIELDS:
    required = Boolean()  # Whether field is required
    enum_values = Array(String())  # For enum type fields
    indexed = Boolean()  # Whether field should be indexed
 ```
 ### 2. New Knowledge Schemas
 #### 2.1 Structured Data Submission
 New file: `knowledge/structured.py`
 ```python
 from pulsar.schema import Record, String, Bytes, Map
 from ..core.metadata import Metadata
 class StructuredDataSubmission(Record):
    metadata = Metadata()
    format = String()  # "json", "csv", "xml"
    schema_name = String()  # Reference to schema in config
    data = Bytes()  # Raw data to ingest
    options = Map(String())  # Format-specific options
 ```
 ### 3. New Service Schemas
 #### 3.1 NLP to Structured Query Service
 New file: `services/nlp_query.py`
 ```python
 from pulsar.schema import Record, String, Array, Map, Integer, Double
 from ..core.primitives import Error
 class NLPToStructuredQueryRequest(Record):
    natural_language_query = String()
    max_results = Integer()
    context_hints = Map(String())  # Optional context for query generation
 class NLPToStructuredQueryResponse(Record):
    error = Error()
    graphql_query = String()  # Generated GraphQL query
    variables = Map(String())  # GraphQL variables if any
    detected_schemas = Array(String())  # Which schemas the query targets
    confidence = Double()
 ```
 #### 3.2 Structured Query Service
 New file: `services/structured_query.py`
 ```python
 from pulsar.schema import Record, String, Map, Array
 from ..core.primitives import Error
 class StructuredQueryRequest(Record):
    query = String()  # GraphQL query
    variables = Map(String())  # GraphQL variables
    operation_name = String()  # Optional operation name for multi-operation documents
 class StructuredQueryResponse(Record):
    error = Error()
    data = String()  # JSON-encoded GraphQL response data
    errors = Array(String())  # GraphQL errors if any
 ```
 #### 2.2 Object Extraction Output
 New file: `knowledge/object.py`
 ```python
 from pulsar.schema import Record, String, Map, Double
 from ..core.metadata import Metadata
 class ExtractedObject(Record):
    metadata = Metadata()
    schema_name = String()  # Which schema this object belongs to
    values = Map(String())  # Field name -> value
    confidence = Double()
    source_span = String()  # Text span where object was found
 ```
 ### 4. Enhanced Knowledge Schemas
 #### 4.1 Object Embeddings Enhancement
 Update `knowledge/embeddings.py` to support structured object embeddings better:
 ```python
 class StructuredObjectEmbedding(Record):
    metadata = Metadata()
    vectors = Array(Array(Double()))
    schema_name = String()
    object_id = String()  # Primary key value
    field_embeddings = Map(Array(Double()))  # Per-field embeddings
 ```
 ## Integration Points
 ### Flow Integration
 The schemas will be used by new flow modules:
 - `trustgraph-flow/trustgraph/decoding/structured` - Uses StructuredDataSubmission
 - `trustgraph-flow/trustgraph/query/nlp_query/cassandra` - Uses NLP query schemas
 - `trustgraph-flow/trustgraph/query/objects/cassandra` - Uses structured query schemas
 - `trustgraph-flow/trustgraph/extract/object/row/` - Consumes Chunk, produces ExtractedObject
 - `trustgraph-flow/trustgraph/storage/objects/cassandra` - Uses Rows schema
 - `trustgraph-flow/trustgraph/embeddings/object_embeddings/qdrant` - Uses object embedding schemas
 ## Implementation Notes
 1. **Schema Versioning**: Consider adding a `version` field to RowSchema for future migration support
 2. **Type System**: The `Field.type` should support all Cassandra native types
 3. **Batch Operations**: Most services should support both single and batch operations
 4. **Error Handling**: Consistent error reporting across all new services
 5. **Backwards Compatibility**: Existing schemas remain unchanged except for minor Field enhancements
 ## Next Steps
 1. Implement schema files in the new structure
 2. Update existing services to recognize new schema types
 3. Implement flow modules that use these schemas
 4. Add gateway/rev-gateway endpoints for new services
 5. Create unit tests for schema validation
--- a/docs/tech-specs/rag-streaming-support.md
+++ b/docs/tech-specs/rag-streaming-support.md
@ -0,0 +1,288 @@
 # RAG Streaming Support Technical Specification
 ## Overview
 This specification describes adding streaming support to GraphRAG and DocumentRAG services, enabling real-time token-by-token responses for knowledge graph and document retrieval queries. This extends the existing streaming architecture already implemented for LLM text-completion, prompt, and agent services.
 ## Goals
 - **Consistent streaming UX**: Provide the same streaming experience across all TrustGraph services
 - **Minimal API changes**: Add streaming support with a single `streaming` flag, following established patterns
 - **Backward compatibility**: Maintain existing non-streaming behavior as default
 - **Reuse existing infrastructure**: Leverage PromptClient streaming already implemented
 - **Gateway support**: Enable streaming through websocket gateway for client applications
 ## Background
 Currently implemented streaming services:
 - **LLM text-completion service**: Phase 1 - streaming from LLM providers
 - **Prompt service**: Phase 2 - streaming through prompt templates
 - **Agent service**: Phase 3-4 - streaming ReAct responses with incremental thought/observation/answer chunks
 Current limitations for RAG services:
 - GraphRAG and DocumentRAG only support blocking responses
 - Users must wait for complete LLM response before seeing any output
 - Poor UX for long responses from knowledge graph or document queries
 - Inconsistent experience compared to other TrustGraph services
 This specification addresses these gaps by adding streaming support to GraphRAG and DocumentRAG. By enabling token-by-token responses, TrustGraph can:
 - Provide consistent streaming UX across all query types
 - Reduce perceived latency for RAG queries
 - Enable better progress feedback for long-running queries
 - Support real-time display in client applications
 ## Technical Design
 ### Architecture
 The RAG streaming implementation leverages existing infrastructure:
 1. **PromptClient Streaming** (Already implemented)
   - `kg_prompt()` and `document_prompt()` already accept `streaming` and `chunk_callback` parameters
   - These call `prompt()` internally with streaming support
   - No changes needed to PromptClient
   Module: `trustgraph-base/trustgraph/base/prompt_client.py`
 2. **GraphRAG Service** (Needs streaming parameter pass-through)
   - Add `streaming` parameter to `query()` method
   - Pass streaming flag and callbacks to `prompt_client.kg_prompt()`
   - GraphRagRequest schema needs `streaming` field
   Modules:
   - `trustgraph-flow/trustgraph/retrieval/graph_rag/graph_rag.py`
   - `trustgraph-flow/trustgraph/retrieval/graph_rag/rag.py` (Processor)
   - `trustgraph-base/trustgraph/schema/graph_rag.py` (Request schema)
   - `trustgraph-flow/trustgraph/gateway/dispatch/graph_rag.py` (Gateway)
 3. **DocumentRAG Service** (Needs streaming parameter pass-through)
   - Add `streaming` parameter to `query()` method
   - Pass streaming flag and callbacks to `prompt_client.document_prompt()`
   - DocumentRagRequest schema needs `streaming` field
   Modules:
   - `trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py`
   - `trustgraph-flow/trustgraph/retrieval/document_rag/rag.py` (Processor)
   - `trustgraph-base/trustgraph/schema/document_rag.py` (Request schema)
   - `trustgraph-flow/trustgraph/gateway/dispatch/document_rag.py` (Gateway)
 ### Data Flow
 **Non-streaming (current)**:
 ```
 Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=False)
                                   ↓
                                Prompt Service → LLM
                                   ↓
                                Complete response
                                   ↓
 Client ← Gateway ← RAG Service ←  Response
 ```
 **Streaming (proposed)**:
 ```
 Client → Gateway → RAG Service → PromptClient.kg_prompt(streaming=True, chunk_callback=cb)
                                   ↓
                                Prompt Service → LLM (streaming)
                                   ↓
                                Chunk → callback → RAG Response (chunk)
                                   ↓                       ↓
 Client ← Gateway ← ────────────────────────────────── Response stream
 ```
 ### APIs
 **GraphRAG Changes**:
 1. **GraphRag.query()** - Add streaming parameters
 ```python
 async def query(
    self, query, user, collection,
    verbose=False, streaming=False, chunk_callback=None  # NEW
 ):
    # ... existing entity/triple retrieval ...
    if streaming and chunk_callback:
        resp = await self.prompt_client.kg_prompt(
            query, kg,
            streaming=True,
            chunk_callback=chunk_callback
        )
    else:
        resp = await self.prompt_client.kg_prompt(query, kg)
    return resp
 ```
 2. **GraphRagRequest schema** - Add streaming field
 ```python
 class GraphRagRequest(Record):
    query = String()
    user = String()
    collection = String()
    streaming = Boolean()  # NEW
 ```
 3. **GraphRagResponse schema** - Add streaming fields (follow Agent pattern)
 ```python
 class GraphRagResponse(Record):
    response = String()    # Legacy: complete response
    chunk = String()       # NEW: streaming chunk
    end_of_stream = Boolean()  # NEW: indicates last chunk
 ```
 4. **Processor** - Pass streaming through
 ```python
 async def handle(self, msg):
    # ... existing code ...
    async def send_chunk(chunk):
        await self.respond(GraphRagResponse(
            chunk=chunk,
            end_of_stream=False,
            response=None
        ))
    if request.streaming:
        full_response = await self.rag.query(
            query=request.query,
            user=request.user,
            collection=request.collection,
            streaming=True,
            chunk_callback=send_chunk
        )
        # Send final message
        await self.respond(GraphRagResponse(
            chunk=None,
            end_of_stream=True,
            response=full_response
        ))
    else:
        # Existing non-streaming path
        response = await self.rag.query(...)
        await self.respond(GraphRagResponse(response=response))
 ```
 **DocumentRAG Changes**:
 Identical pattern to GraphRAG:
 1. Add `streaming` and `chunk_callback` parameters to `DocumentRag.query()`
 2. Add `streaming` field to `DocumentRagRequest`
 3. Add `chunk` and `end_of_stream` fields to `DocumentRagResponse`
 4. Update Processor to handle streaming with callbacks
 **Gateway Changes**:
 Both `graph_rag.py` and `document_rag.py` in gateway/dispatch need updates to forward streaming chunks to websocket:
 ```python
 async def handle(self, message, session, websocket):
    # ... existing code ...
    if request.streaming:
        async def recipient(resp):
            if resp.chunk:
                await websocket.send(json.dumps({
                    "id": message["id"],
                    "response": {"chunk": resp.chunk},
                    "complete": resp.end_of_stream
                }))
            return resp.end_of_stream
        await self.rag_client.request(request, recipient=recipient)
    else:
        # Existing non-streaming path
        resp = await self.rag_client.request(request)
        await websocket.send(...)
 ```
 ### Implementation Details
 **Implementation order**:
 1. Add schema fields (Request + Response for both RAG services)
 2. Update GraphRag.query() and DocumentRag.query() methods
 3. Update Processors to handle streaming
 4. Update Gateway dispatch handlers
 5. Add `--no-streaming` flags to `tg-invoke-graph-rag` and `tg-invoke-document-rag` (streaming enabled by default, following agent CLI pattern)
 **Callback pattern**:
 Follow the same async callback pattern established in Agent streaming:
 - Processor defines `async def send_chunk(chunk)` callback
 - Passes callback to RAG service
 - RAG service passes callback to PromptClient
 - PromptClient invokes callback for each LLM chunk
 - Processor sends streaming response message for each chunk
 **Error handling**:
 - Errors during streaming should send error response with `end_of_stream=True`
 - Follow existing error propagation patterns from Agent streaming
 ## Security Considerations
 No new security considerations beyond existing RAG services:
 - Streaming responses use same user/collection isolation
 - No changes to authentication or authorization
 - Chunk boundaries don't expose sensitive data
 ## Performance Considerations
 **Benefits**:
 - Reduced perceived latency (first tokens arrive faster)
 - Better UX for long responses
 - Lower memory usage (no need to buffer complete response)
 **Potential concerns**:
 - More Pulsar messages for streaming responses
 - Slightly higher CPU for chunking/callback overhead
 - Mitigated by: streaming is opt-in, default remains non-streaming
 **Testing considerations**:
 - Test with large knowledge graphs (many triples)
 - Test with many retrieved documents
 - Measure overhead of streaming vs non-streaming
 ## Testing Strategy
 **Unit tests**:
 - Test GraphRag.query() with streaming=True/False
 - Test DocumentRag.query() with streaming=True/False
 - Mock PromptClient to verify callback invocations
 **Integration tests**:
 - Test full GraphRAG streaming flow (similar to existing agent streaming tests)
 - Test full DocumentRAG streaming flow
 - Test Gateway streaming forwarding
 - Test CLI streaming output
 **Manual testing**:
 - `tg-invoke-graph-rag -q "What is machine learning?"` (streaming by default)
 - `tg-invoke-document-rag -q "Summarize the documents about AI"` (streaming by default)
 - `tg-invoke-graph-rag --no-streaming -q "..."` (test non-streaming mode)
 - Verify incremental output appears in streaming mode
 ## Migration Plan
 No migration needed:
 - Streaming is opt-in via `streaming` parameter (defaults to False)
 - Existing clients continue to work unchanged
 - New clients can opt into streaming
 ## Timeline
 Estimated implementation: 4-6 hours
 - Phase 1 (2 hours): GraphRAG streaming support
 - Phase 2 (2 hours): DocumentRAG streaming support
 - Phase 3 (1-2 hours): Gateway updates and CLI flags
 - Testing: Built into each phase
 ## Open Questions
 - Should we add streaming support to NLP Query service as well?
 - Do we want to stream intermediate steps (e.g., "Retrieving entities...", "Querying graph...") or just LLM output?
 - Should GraphRAG/DocumentRAG responses include chunk metadata (e.g., chunk number, total expected)?
 ## References
 - Existing implementation: `docs/tech-specs/streaming-llm-responses.md`
 - Agent streaming: `trustgraph-flow/trustgraph/agent/react/agent_manager.py`
 - PromptClient streaming: `trustgraph-base/trustgraph/base/prompt_client.py`
--- a/docs/tech-specs/streaming-llm-responses.md
+++ b/docs/tech-specs/streaming-llm-responses.md
@ -0,0 +1,570 @@
 # Streaming LLM Responses Technical Specification
 ## Overview
 This specification describes the implementation of streaming support for LLM
 responses in TrustGraph. Streaming enables real-time delivery of generated
 tokens as they are produced by the LLM, rather than waiting for complete
 response generation.
 This implementation supports the following use cases:
 1. **Real-time User Interfaces**: Stream tokens to UI as they are generated,
   providing immediate visual feedback
 2. **Reduced Time-to-First-Token**: Users see output beginning immediately
   rather than waiting for full generation
 3. **Long Response Handling**: Handle very long outputs that might otherwise
   timeout or exceed memory limits
 4. **Interactive Applications**: Enable responsive chat and agent interfaces
 ## Goals
 - **Backward Compatibility**: Existing non-streaming clients continue to work
  without modification
 - **Consistent API Design**: Streaming and non-streaming use the same schema
  patterns with minimal divergence
 - **Provider Flexibility**: Support streaming where available, graceful
  fallback where not
 - **Phased Rollout**: Incremental implementation to reduce risk
 - **End-to-End Support**: Streaming from LLM provider through to client
  applications via Pulsar, Gateway API, and Python API
 ## Background
 ### Current Architecture
 The current LLM text completion flow operates as follows:
 1. Client sends `TextCompletionRequest` with `system` and `prompt` fields
 2. LLM service processes the request and waits for complete generation
 3. Single `TextCompletionResponse` returned with complete `response` string
 Current schema (`trustgraph-base/trustgraph/schema/services/llm.py`):
 ```python
 class TextCompletionRequest(Record):
    system = String()
    prompt = String()
 class TextCompletionResponse(Record):
    error = Error()
    response = String()
    in_token = Integer()
    out_token = Integer()
    model = String()
 ```
 ### Current Limitations
 - **Latency**: Users must wait for complete generation before seeing any output
 - **Timeout Risk**: Long generations may exceed client timeout thresholds
 - **Poor UX**: No feedback during generation creates perception of slowness
 - **Resource Usage**: Full responses must be buffered in memory
 This specification addresses these limitations by enabling incremental response
 delivery while maintaining full backward compatibility.
 ## Technical Design
 ### Phase 1: Infrastructure
 Phase 1 establishes the foundation for streaming by modifying schemas, APIs,
 and CLI tools.
 #### Schema Changes
 ##### LLM Schema (`trustgraph-base/trustgraph/schema/services/llm.py`)
 **Request Changes:**
 ```python
 class TextCompletionRequest(Record):
    system = String()
    prompt = String()
    streaming = Boolean()  # NEW: Default false for backward compatibility
 ```
 - `streaming`: When `true`, requests streaming response delivery
 - Default: `false` (existing behavior preserved)
 **Response Changes:**
 ```python
 class TextCompletionResponse(Record):
    error = Error()
    response = String()
    in_token = Integer()
    out_token = Integer()
    model = String()
    end_of_stream = Boolean()  # NEW: Indicates final message
 ```
 - `end_of_stream`: When `true`, indicates this is the final (or only) response
 - For non-streaming requests: Single response with `end_of_stream=true`
 - For streaming requests: Multiple responses, all with `end_of_stream=false`
  except the final one
 ##### Prompt Schema (`trustgraph-base/trustgraph/schema/services/prompt.py`)
 The prompt service wraps text completion, so it mirrors the same pattern:
 **Request Changes:**
 ```python
 class PromptRequest(Record):
    id = String()
    terms = Map(String())
    streaming = Boolean()  # NEW: Default false
 ```
 **Response Changes:**
 ```python
 class PromptResponse(Record):
    error = Error()
    text = String()
    object = String()
    end_of_stream = Boolean()  # NEW: Indicates final message
 ```
 #### Gateway API Changes
 The Gateway API must expose streaming capabilities to HTTP/WebSocket clients.
 **REST API Updates:**
 - `POST /api/v1/text-completion`: Accept `streaming` parameter in request body
 - Response behavior depends on streaming flag:
  - `streaming=false`: Single JSON response (current behavior)
  - `streaming=true`: Server-Sent Events (SSE) stream or WebSocket messages
 **Response Format (Streaming):**
 Each streamed chunk follows the same schema structure:
 ```json
 {
  "response": "partial text...",
  "end_of_stream": false,
  "model": "model-name"
 }
 ```
 Final chunk:
 ```json
 {
  "response": "final text chunk",
  "end_of_stream": true,
  "in_token": 150,
  "out_token": 500,
  "model": "model-name"
 }
 ```
 #### Python API Changes
 The Python client API must support both streaming and non-streaming modes
 while maintaining backward compatibility.
 **LlmClient Updates** (`trustgraph-base/trustgraph/clients/llm_client.py`):
 ```python
 class LlmClient(BaseClient):
    def request(self, system, prompt, timeout=300, streaming=False):
        """
        Non-streaming request (backward compatible).
        Returns complete response string.
        """
        # Existing behavior when streaming=False
    async def request_stream(self, system, prompt, timeout=300):
        """
        Streaming request.
        Yields response chunks as they arrive.
        """
        # New async generator method
 ```
 **PromptClient Updates** (`trustgraph-base/trustgraph/base/prompt_client.py`):
 Similar pattern with `streaming` parameter and async generator variant.
 #### CLI Tool Changes
 **tg-invoke-llm** (`trustgraph-cli/trustgraph/cli/invoke_llm.py`):
 ```
 tg-invoke-llm [system] [prompt] [--no-streaming] [-u URL] [-f flow-id]
 ```
 - Streaming enabled by default for better interactive UX
 - `--no-streaming` flag disables streaming
 - When streaming: Output tokens to stdout as they arrive
 - When not streaming: Wait for complete response, then output
 **tg-invoke-prompt** (`trustgraph-cli/trustgraph/cli/invoke_prompt.py`):
 ```
 tg-invoke-prompt [template-id] [var=value...] [--no-streaming] [-u URL] [-f flow-id]
 ```
 Same pattern as `tg-invoke-llm`.
 #### LLM Service Base Class Changes
 **LlmService** (`trustgraph-base/trustgraph/base/llm_service.py`):
 ```python
 class LlmService(FlowProcessor):
    async def on_request(self, msg, consumer, flow):
        request = msg.value()
        streaming = getattr(request, 'streaming', False)
        if streaming and self.supports_streaming():
            async for chunk in self.generate_content_stream(...):
                await self.send_response(chunk, end_of_stream=False)
            await self.send_response(final_chunk, end_of_stream=True)
        else:
            response = await self.generate_content(...)
            await self.send_response(response, end_of_stream=True)
    def supports_streaming(self):
        """Override in subclass to indicate streaming support."""
        return False
    async def generate_content_stream(self, system, prompt, model, temperature):
        """Override in subclass to implement streaming."""
        raise NotImplementedError()
 ```
 ---
 ### Phase 2: VertexAI Proof of Concept
 Phase 2 implements streaming in a single provider (VertexAI) to validate the
 infrastructure and enable end-to-end testing.
 #### VertexAI Implementation
 **Module:** `trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py`
 **Changes:**
 1. Override `supports_streaming()` to return `True`
 2. Implement `generate_content_stream()` async generator
 3. Handle both Gemini and Claude models (via VertexAI Anthropic API)
 **Gemini Streaming:**
 ```python
 async def generate_content_stream(self, system, prompt, model, temperature):
    model_instance = self.get_model(model, temperature)
    response = model_instance.generate_content(
        [system, prompt],
        stream=True  # Enable streaming
    )
    for chunk in response:
        yield LlmChunk(
            text=chunk.text,
            in_token=None,  # Available only in final chunk
            out_token=None,
        )
    # Final chunk includes token counts from response.usage_metadata
 ```
 **Claude (via VertexAI Anthropic) Streaming:**
 ```python
 async def generate_content_stream(self, system, prompt, model, temperature):
    with self.anthropic_client.messages.stream(...) as stream:
        for text in stream.text_stream:
            yield LlmChunk(text=text)
    # Token counts from stream.get_final_message()
 ```
 #### Testing
 - Unit tests for streaming response assembly
 - Integration tests with VertexAI (Gemini and Claude)
 - End-to-end tests: CLI -> Gateway -> Pulsar -> VertexAI -> back
 - Backward compatibility tests: Non-streaming requests still work
 ---
 ### Phase 3: All LLM Providers
 Phase 3 extends streaming support to all LLM providers in the system.
 #### Provider Implementation Status
 Each provider must either:
 1. **Full Streaming Support**: Implement `generate_content_stream()`
 2. **Compatibility Mode**: Handle the `end_of_stream` flag correctly
   (return single response with `end_of_stream=true`)
 | Provider | Package | Streaming Support |
 |----------|---------|-------------------|
 | OpenAI | trustgraph-flow | Full (native streaming API) |
 | Claude/Anthropic | trustgraph-flow | Full (native streaming API) |
 | Ollama | trustgraph-flow | Full (native streaming API) |
 | Cohere | trustgraph-flow | Full (native streaming API) |
 | Mistral | trustgraph-flow | Full (native streaming API) |
 | Azure OpenAI | trustgraph-flow | Full (native streaming API) |
 | Google AI Studio | trustgraph-flow | Full (native streaming API) |
 | VertexAI | trustgraph-vertexai | Full (Phase 2) |
 | Bedrock | trustgraph-bedrock | Full (native streaming API) |
 | LM Studio | trustgraph-flow | Full (OpenAI-compatible) |
 | LlamaFile | trustgraph-flow | Full (OpenAI-compatible) |
 | vLLM | trustgraph-flow | Full (OpenAI-compatible) |
 | TGI | trustgraph-flow | TBD |
 | Azure | trustgraph-flow | TBD |
 #### Implementation Pattern
 For OpenAI-compatible providers (OpenAI, LM Studio, LlamaFile, vLLM):
 ```python
 async def generate_content_stream(self, system, prompt, model, temperature):
    response = await self.client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ],
        temperature=temperature,
        stream=True
    )
    async for chunk in response:
        if chunk.choices[0].delta.content:
            yield LlmChunk(text=chunk.choices[0].delta.content)
 ```
 ---
 ### Phase 4: Agent API
 Phase 4 extends streaming to the Agent API. This is more complex because the
 Agent API is already multi-message by nature (thought → action → observation
 → repeat → final answer).
 #### Current Agent Schema
 ```python
 class AgentStep(Record):
    thought = String()
    action = String()
    arguments = Map(String())
    observation = String()
    user = String()
 class AgentRequest(Record):
    question = String()
    state = String()
    group = Array(String())
    history = Array(AgentStep())
    user = String()
 class AgentResponse(Record):
    answer = String()
    error = Error()
    thought = String()
    observation = String()
 ```
 #### Proposed Agent Schema Changes
 **Request Changes:**
 ```python
 class AgentRequest(Record):
    question = String()
    state = String()
    group = Array(String())
    history = Array(AgentStep())
    user = String()
    streaming = Boolean()  # NEW: Default false
 ```
 **Response Changes:**
 The agent produces multiple types of output during its reasoning cycle:
 - Thoughts (reasoning)
 - Actions (tool calls)
 - Observations (tool results)
 - Answer (final response)
 - Errors
 Since `chunk_type` identifies what kind of content is being sent, the separate
 `answer`, `error`, `thought`, and `observation` fields can be collapsed into
 a single `content` field:
 ```python
 class AgentResponse(Record):
    chunk_type = String()       # "thought", "action", "observation", "answer", "error"
    content = String()          # The actual content (interpretation depends on chunk_type)
    end_of_message = Boolean()  # Current thought/action/observation/answer is complete
    end_of_dialog = Boolean()   # Entire agent dialog is complete
 ```
 **Field Semantics:**
 - `chunk_type`: Indicates what type of content is in the `content` field
  - `"thought"`: Agent reasoning/thinking
  - `"action"`: Tool/action being invoked
  - `"observation"`: Result from tool execution
  - `"answer"`: Final answer to the user's question
  - `"error"`: Error message
 - `content`: The actual streamed content, interpreted based on `chunk_type`
 - `end_of_message`: When `true`, the current chunk type is complete
  - Example: All tokens for the current thought have been sent
  - Allows clients to know when to move to the next stage
 - `end_of_dialog`: When `true`, the entire agent interaction is complete
  - This is the final message in the stream
 #### Agent Streaming Behavior
 When `streaming=true`:
 1. **Thought streaming**:
   - Multiple chunks with `chunk_type="thought"`, `end_of_message=false`
   - Final thought chunk has `end_of_message=true`
 2. **Action notification**:
   - Single chunk with `chunk_type="action"`, `end_of_message=true`
 3. **Observation**:
   - Chunk(s) with `chunk_type="observation"`, final has `end_of_message=true`
 4. **Repeat** steps 1-3 as the agent reasons
 5. **Final answer**:
   - `chunk_type="answer"` with the final response in `content`
   - Last chunk has `end_of_message=true`, `end_of_dialog=true`
 **Example Stream Sequence:**
 ```
 {chunk_type: "thought", content: "I need to", end_of_message: false, end_of_dialog: false}
 {chunk_type: "thought", content: " search for...", end_of_message: true, end_of_dialog: false}
 {chunk_type: "action", content: "search", end_of_message: true, end_of_dialog: false}
 {chunk_type: "observation", content: "Found: ...", end_of_message: true, end_of_dialog: false}
 {chunk_type: "thought", content: "Based on this", end_of_message: false, end_of_dialog: false}
 {chunk_type: "thought", content: " I can answer...", end_of_message: true, end_of_dialog: false}
 {chunk_type: "answer", content: "The answer is...", end_of_message: true, end_of_dialog: true}
 ```
 When `streaming=false`:
 - Current behavior preserved
 - Single response with complete answer
 - `end_of_message=true`, `end_of_dialog=true`
 #### Gateway and Python API
 - Gateway: New SSE/WebSocket endpoint for agent streaming
 - Python API: New `agent_stream()` async generator method
 ---
 ## Security Considerations
 - **No new attack surface**: Streaming uses same authentication/authorization
 - **Rate limiting**: Apply per-token or per-chunk rate limits if needed
 - **Connection handling**: Properly terminate streams on client disconnect
 - **Timeout management**: Streaming requests need appropriate timeout handling
 ## Performance Considerations
 - **Memory**: Streaming reduces peak memory usage (no full response buffering)
 - **Latency**: Time-to-first-token significantly reduced
 - **Connection overhead**: SSE/WebSocket connections have keep-alive overhead
 - **Pulsar throughput**: Multiple small messages vs. single large message
  tradeoff
 ## Testing Strategy
 ### Unit Tests
 - Schema serialization/deserialization with new fields
 - Backward compatibility (missing fields use defaults)
 - Chunk assembly logic
 ### Integration Tests
 - Each LLM provider's streaming implementation
 - Gateway API streaming endpoints
 - Python client streaming methods
 ### End-to-End Tests
 - CLI tool streaming output
 - Full flow: Client → Gateway → Pulsar → LLM → back
 - Mixed streaming/non-streaming workloads
 ### Backward Compatibility Tests
 - Existing clients work without modification
 - Non-streaming requests behave identically
 ## Migration Plan
 ### Phase 1: Infrastructure
 - Deploy schema changes (backward compatible)
 - Deploy Gateway API updates
 - Deploy Python API updates
 - Release CLI tool updates
 ### Phase 2: VertexAI
 - Deploy VertexAI streaming implementation
 - Validate with test workloads
 ### Phase 3: All Providers
 - Roll out provider updates incrementally
 - Monitor for issues
 ### Phase 4: Agent API
 - Deploy agent schema changes
 - Deploy agent streaming implementation
 - Update documentation
 ## Timeline
 | Phase | Description | Dependencies |
 |-------|-------------|--------------|
 | Phase 1 | Infrastructure | None |
 | Phase 2 | VertexAI PoC | Phase 1 |
 | Phase 3 | All Providers | Phase 2 |
 | Phase 4 | Agent API | Phase 3 |
 ## Design Decisions
 The following questions were resolved during specification:
 1. **Token Counts in Streaming**: Token counts are deltas, not running totals.
   Consumers can sum them if needed. This matches how most providers report
   usage and simplifies the implementation.
 2. **Error Handling in Streams**: If an error occurs, the `error` field is
   populated and no other fields are needed. An error is always the final
   communication - no subsequent messages are permitted or expected after
   an error. For LLM/Prompt streams, `end_of_stream=true`. For Agent streams,
   `chunk_type="error"` with `end_of_dialog=true`.
 3. **Partial Response Recovery**: The messaging protocol (Pulsar) is resilient,
   so message-level retry is not needed. If a client loses track of the stream
   or disconnects, it must retry the full request from scratch.
 4. **Prompt Service Streaming**: Streaming is only supported for text (`text`)
   responses, not structured (`object`) responses. The prompt service knows at
   the outset whether the output will be JSON or text based on the prompt
   template. If a streaming request is made for a JSON-output prompt, the
   service should either:
   - Return the complete JSON in a single response with `end_of_stream=true`, or
   - Reject the streaming request with an error
 ## Open Questions
 None at this time.
 ## References
 - Current LLM schema: `trustgraph-base/trustgraph/schema/services/llm.py`
 - Current prompt schema: `trustgraph-base/trustgraph/schema/services/prompt.py`
 - Current agent schema: `trustgraph-base/trustgraph/schema/services/agent.py`
 - LLM service base: `trustgraph-base/trustgraph/base/llm_service.py`
 - VertexAI provider: `trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py`
 - Gateway API: `trustgraph-base/trustgraph/api/`
 - CLI tools: `trustgraph-cli/trustgraph/cli/`
--- a/tests/integration/conftest.py
+++ b/tests/integration/conftest.py
@ -382,6 +382,206 @@ def sample_kg_triples():
    ]
 # Streaming test fixtures
@pytest.fixture
 def mock_streaming_llm_response():
    """Mock streaming LLM response with realistic chunks"""
    async def _generate_chunks():
        """Generate realistic streaming chunks"""
        chunks = [
            "Machine",
            " learning",
            " is",
            " a",
            " subset",
            " of",
            " artificial",
            " intelligence",
            " that",
            " focuses",
            " on",
            " algorithms",
            " that",
            " learn",
            " from",
            " data",
            "."
        ]
        for chunk in chunks:
            yield chunk
    return _generate_chunks
@pytest.fixture
 def sample_streaming_agent_response():
    """Sample streaming agent response chunks"""
    return [
        {
            "chunk_type": "thought",
            "content": "I need to search",
            "end_of_message": False,
            "end_of_dialog": False
        },
        {
            "chunk_type": "thought",
            "content": " for information",
            "end_of_message": False,
            "end_of_dialog": False
        },
        {
            "chunk_type": "thought",
            "content": " about machine learning.",
            "end_of_message": True,
            "end_of_dialog": False
        },
        {
            "chunk_type": "action",
            "content": "knowledge_query",
            "end_of_message": True,
            "end_of_dialog": False
        },
        {
            "chunk_type": "observation",
            "content": "Machine learning is",
            "end_of_message": False,
            "end_of_dialog": False
        },
        {
            "chunk_type": "observation",
            "content": " a subset of AI.",
            "end_of_message": True,
            "end_of_dialog": False
        },
        {
            "chunk_type": "final-answer",
            "content": "Machine learning",
            "end_of_message": False,
            "end_of_dialog": False
        },
        {
            "chunk_type": "final-answer",
            "content": " is a subset",
            "end_of_message": False,
            "end_of_dialog": False
        },
        {
            "chunk_type": "final-answer",
            "content": " of artificial intelligence.",
            "end_of_message": True,
            "end_of_dialog": True
        }
    ]
@pytest.fixture
 def streaming_chunk_collector():
    """Helper to collect streaming chunks for assertions"""
    class ChunkCollector:
        def __init__(self):
            self.chunks = []
            self.complete = False
        async def collect(self, chunk):
            """Async callback to collect chunks"""
            self.chunks.append(chunk)
        def get_full_text(self):
            """Concatenate all chunk content"""
            return "".join(self.chunks)
        def get_chunk_types(self):
            """Get list of chunk types if chunks are dicts"""
            if self.chunks and isinstance(self.chunks[0], dict):
                return [c.get("chunk_type") for c in self.chunks]
            return []
    return ChunkCollector
@pytest.fixture
 def mock_streaming_prompt_response():
    """Mock streaming prompt service response"""
    async def _generate_prompt_chunks():
        """Generate streaming chunks for prompt responses"""
        chunks = [
            "Based on the",
            " provided context,",
            " here is",
            " the answer:",
            " Machine learning",
            " enables computers",
            " to learn",
            " from data."
        ]
        for chunk in chunks:
            yield chunk
    return _generate_prompt_chunks
@pytest.fixture
 def sample_rag_streaming_chunks():
    """Sample RAG streaming response chunks"""
    return [
        {
            "chunk": "Based on",
            "end_of_stream": False
        },
        {
            "chunk": " the knowledge",
            "end_of_stream": False
        },
        {
            "chunk": " graph,",
            "end_of_stream": False
        },
        {
            "chunk": " machine learning",
            "end_of_stream": False
        },
        {
            "chunk": " is a subset",
            "end_of_stream": False
        },
        {
            "chunk": " of AI.",
            "end_of_stream": False
        },
        {
            "chunk": None,
            "end_of_stream": True,
            "response": "Based on the knowledge graph, machine learning is a subset of AI."
        }
    ]
@pytest.fixture
 def streaming_error_scenarios():
    """Common error scenarios for streaming tests"""
    return {
        "connection_drop": {
            "exception": ConnectionError,
            "message": "Connection lost during streaming",
            "chunks_before_error": 5
        },
        "timeout": {
            "exception": TimeoutError,
            "message": "Streaming timeout exceeded",
            "chunks_before_error": 10
        },
        "rate_limit": {
            "exception": Exception,
            "message": "Rate limit exceeded",
            "chunks_before_error": 3
        },
        "invalid_chunk": {
            "exception": ValueError,
            "message": "Invalid chunk format",
            "chunks_before_error": 7
        }
    }
 # Test markers for integration tests
 pytestmark = pytest.mark.integration
--- a/tests/integration/test_agent_manager_integration.py
+++ b/tests/integration/test_agent_manager_integration.py
@ -135,10 +135,10 @@ Args: {
        # Verify prompt client was called correctly
        prompt_client = mock_flow_context("prompt-request")
        prompt_client.agent_react.assert_called_once()
-        
+
        # Verify the prompt variables passed to agent_react
        call_args = prompt_client.agent_react.call_args
-        variables = call_args[0][0]
+        variables = call_args.kwargs['variables']
        assert variables["question"] == question
        assert len(variables["tools"]) == 3  # knowledge_query, text_completion, web_search
        assert variables["context"] == "You are a helpful AI assistant with access to knowledge and tools."
@ -182,8 +182,8 @@ Final Answer: Machine learning is a field of AI that enables computers to learn
        assert action.observation == "Machine learning is a subset of AI that enables computers to learn from data."
        # Verify callbacks were called
-        think_callback.assert_called_once_with("I need to search for information about machine learning")
+        think_callback.assert_called_once_with("I need to search for information about machine learning", is_final=True)
-        observe_callback.assert_called_once_with("Machine learning is a subset of AI that enables computers to learn from data.")
+        observe_callback.assert_called_once_with("Machine learning is a subset of AI that enables computers to learn from data.", is_final=True)
        # Verify tool was executed
        graph_rag_client = mock_flow_context("graph-rag-request")
@ -211,7 +211,7 @@ Final Answer: Machine learning is a branch of artificial intelligence."""
        assert action.final == "Machine learning is a branch of artificial intelligence."
        # Verify only think callback was called (no observation for final answer)
-        think_callback.assert_called_once_with("I can provide a direct answer")
+        think_callback.assert_called_once_with("I can provide a direct answer", is_final=True)
        observe_callback.assert_not_called()
    @pytest.mark.asyncio
@ -237,7 +237,7 @@ Final Answer: Machine learning is a branch of artificial intelligence."""
        # Verify history was included in prompt variables
        prompt_client = mock_flow_context("prompt-request")
        call_args = prompt_client.agent_react.call_args
-        variables = call_args[0][0]
+        variables = call_args.kwargs['variables']
        assert len(variables["history"]) == 1
        assert variables["history"][0]["thought"] == "I need to search for information about machine learning"
        assert variables["history"][0]["action"] == "knowledge_query"
@ -337,7 +337,7 @@ Args: {
        # Verify tool information was passed to prompt
        prompt_client = mock_flow_context("prompt-request")
        call_args = prompt_client.agent_react.call_args
-        variables = call_args[0][0]
+        variables = call_args.kwargs['variables']
        # Should have all 3 tools available
        tool_names = [tool["name"] for tool in variables["tools"]]
@ -408,7 +408,7 @@ Args: {args_json}"""
        # Assert
        prompt_client = mock_flow_context("prompt-request")
        call_args = prompt_client.agent_react.call_args
-        variables = call_args[0][0]
+        variables = call_args.kwargs['variables']
        assert variables["context"] == "You are an expert in machine learning research."
        assert variables["question"] == question
@ -427,7 +427,7 @@ Args: {args_json}"""
        # Assert
        prompt_client = mock_flow_context("prompt-request")
        call_args = prompt_client.agent_react.call_args
-        variables = call_args[0][0]
+        variables = call_args.kwargs['variables']
        assert len(variables["tools"]) == 0
        assert variables["tool_names"] == ""
@ -457,7 +457,7 @@ Args: {args_json}"""
            # Assert
            assert isinstance(action, Action)
            assert action.observation == expected_response.strip()
-            observe_callback.assert_called_with(expected_response.strip())
+            observe_callback.assert_called_with(expected_response.strip(), is_final=True)
            # Reset mocks
            mock_flow_context("graph-rag-request").reset_mock()
@ -682,7 +682,7 @@ Final Answer: {
        # Verify history was processed correctly
        prompt_client = mock_flow_context("prompt-request")
        call_args = prompt_client.agent_react.call_args
-        variables = call_args[0][0]
+        variables = call_args.kwargs['variables']
        assert len(variables["history"]) == 50
    @pytest.mark.asyncio
@ -709,7 +709,7 @@ Final Answer: {
        # Verify JSON was properly serialized in prompt
        prompt_client = mock_flow_context("prompt-request")
        call_args = prompt_client.agent_react.call_args
-        variables = call_args[0][0]
+        variables = call_args.kwargs['variables']
        # Should not raise JSON serialization errors
        json_str = json.dumps(variables, indent=4)
--- a/tests/integration/test_agent_streaming_integration.py
+++ b/tests/integration/test_agent_streaming_integration.py
@ -0,0 +1,395 @@
 """
 Integration tests for Agent Manager Streaming Functionality
 These tests verify the streaming behavior of the Agent service, testing
 chunk-by-chunk delivery of thoughts, actions, observations, and final answers.
 """
 import pytest
 from unittest.mock import AsyncMock, MagicMock
 from trustgraph.agent.react.agent_manager import AgentManager
 from trustgraph.agent.react.tools import KnowledgeQueryImpl
 from trustgraph.agent.react.types import Tool, Argument
 from tests.utils.streaming_assertions import (
    assert_agent_streaming_chunks,
    assert_streaming_chunks_valid,
    assert_callback_invoked,
    assert_chunk_types_valid,
 )
@pytest.mark.integration
 class TestAgentStreaming:
    """Integration tests for Agent streaming functionality"""
    @pytest.fixture
    def mock_prompt_client_streaming(self):
        """Mock prompt client with streaming support"""
        client = AsyncMock()
        async def agent_react_streaming(variables, timeout=600, streaming=False, chunk_callback=None):
            # Both modes return the same text for equivalence
            full_text = """Thought: I need to search for information about machine learning.
 Action: knowledge_query
 Args: {
    "question": "What is machine learning?"
 }"""
            if streaming and chunk_callback:
                # Send realistic line-by-line chunks
                # This tests that the parser properly handles "Args:" starting a new chunk
                # (which previously caused a bug where action_buffer was overwritten)
                chunks = [
                    "Thought: I need to search for information about machine learning.\n",
                    "Action: knowledge_query\n",
                    "Args: {\n",  # This used to trigger bug - Args: at start of chunk
                    '    "question": "What is machine learning?"\n',
                    "}"
                ]
                for chunk in chunks:
                    await chunk_callback(chunk)
                return full_text
            else:
                # Non-streaming response - same text
                return full_text
        client.agent_react.side_effect = agent_react_streaming
        return client
    @pytest.fixture
    def mock_flow_context(self, mock_prompt_client_streaming):
        """Mock flow context with streaming prompt client"""
        context = MagicMock()
        # Mock graph RAG client
        graph_rag_client = AsyncMock()
        graph_rag_client.rag.return_value = "Machine learning is a subset of AI."
        def context_router(service_name):
            if service_name == "prompt-request":
                return mock_prompt_client_streaming
            elif service_name == "graph-rag-request":
                return graph_rag_client
            else:
                return AsyncMock()
        context.side_effect = context_router
        return context
    @pytest.fixture
    def sample_tools(self):
        """Sample tool configuration"""
        return {
            "knowledge_query": Tool(
                name="knowledge_query",
                description="Query the knowledge graph",
                arguments=[
                    Argument(
                        name="question",
                        type="string",
                        description="The question to ask"
                    )
                ],
                implementation=KnowledgeQueryImpl,
                config={}
            )
        }
    @pytest.fixture
    def agent_manager(self, sample_tools):
        """Create AgentManager instance with streaming support"""
        return AgentManager(
            tools=sample_tools,
            additional_context="You are a helpful AI assistant."
        )
    @pytest.mark.asyncio
    async def test_agent_streaming_thought_chunks(self, agent_manager, mock_flow_context):
        """Test that thought chunks are streamed correctly"""
        # Arrange
        thought_chunks = []
        async def think(chunk, is_final=False):
            thought_chunks.append(chunk)
        # Act
        await agent_manager.react(
            question="What is machine learning?",
            history=[],
            think=think,
            observe=AsyncMock(),
            context=mock_flow_context,
            streaming=True
        )
        # Assert
        assert len(thought_chunks) > 0
        assert_streaming_chunks_valid(thought_chunks, min_chunks=1)
        # Verify thought content makes sense
        full_thought = "".join(thought_chunks)
        assert "search" in full_thought.lower() or "information" in full_thought.lower()
    @pytest.mark.asyncio
    async def test_agent_streaming_observation_chunks(self, agent_manager, mock_flow_context):
        """Test that observation chunks are streamed correctly"""
        # Arrange
        observation_chunks = []
        async def observe(chunk, is_final=False):
            observation_chunks.append(chunk)
        # Act
        await agent_manager.react(
            question="What is machine learning?",
            history=[],
            think=AsyncMock(),
            observe=observe,
            context=mock_flow_context,
            streaming=True
        )
        # Assert
        # Note: Observations come from tool execution, which may or may not be streamed
        # depending on the tool implementation
        # For now, verify callback was set up
        assert observe is not None
    @pytest.mark.asyncio
    async def test_agent_streaming_vs_non_streaming(self, agent_manager, mock_flow_context):
        """Test that streaming and non-streaming produce equivalent results"""
        # Arrange
        question = "What is machine learning?"
        history = []
        # Act - Non-streaming
        non_streaming_result = await agent_manager.react(
            question=question,
            history=history,
            think=AsyncMock(),
            observe=AsyncMock(),
            context=mock_flow_context,
            streaming=False
        )
        # Act - Streaming
        thought_chunks = []
        observation_chunks = []
        async def think(chunk, is_final=False):
            thought_chunks.append(chunk)
        async def observe(chunk, is_final=False):
            observation_chunks.append(chunk)
        streaming_result = await agent_manager.react(
            question=question,
            history=history,
            think=think,
            observe=observe,
            context=mock_flow_context,
            streaming=True
        )
        # Assert - Results should be equivalent (or both valid)
        assert non_streaming_result is not None
        assert streaming_result is not None
    @pytest.mark.asyncio
    async def test_agent_streaming_callback_invocation(self, agent_manager, mock_flow_context):
        """Test that callbacks are invoked with correct parameters"""
        # Arrange
        think = AsyncMock()
        observe = AsyncMock()
        # Act
        await agent_manager.react(
            question="What is machine learning?",
            history=[],
            think=think,
            observe=observe,
            context=mock_flow_context,
            streaming=True
        )
        # Assert - Think callback should be invoked
        assert think.call_count > 0
        # Verify all callback invocations had string arguments
        for call in think.call_args_list:
            assert len(call.args) > 0
            assert isinstance(call.args[0], str)
    @pytest.mark.asyncio
    async def test_agent_streaming_without_callbacks(self, agent_manager, mock_flow_context):
        """Test streaming parameter without callbacks (should work gracefully)"""
        # Arrange & Act
        result = await agent_manager.react(
            question="What is machine learning?",
            history=[],
            think=AsyncMock(),
            observe=AsyncMock(),
            context=mock_flow_context,
            streaming=True  # Streaming enabled with mock callbacks
        )
        # Assert - Should complete without error
        assert result is not None
    @pytest.mark.asyncio
    async def test_agent_streaming_with_conversation_history(self, agent_manager, mock_flow_context):
        """Test streaming with existing conversation history"""
        # Arrange
        # History should be a list of Action objects
        from trustgraph.agent.react.types import Action
        history = [
            Action(
                thought="I need to search for information about machine learning",
                name="knowledge_query",
                arguments={"question": "What is machine learning?"},
                observation="Machine learning is a subset of AI that enables computers to learn from data."
            )
        ]
        think = AsyncMock()
        # Act
        result = await agent_manager.react(
            question="Tell me more about neural networks",
            history=history,
            think=think,
            observe=AsyncMock(),
            context=mock_flow_context,
            streaming=True
        )
        # Assert
        assert result is not None
        assert think.call_count > 0
    @pytest.mark.asyncio
    async def test_agent_streaming_error_propagation(self, agent_manager, mock_flow_context):
        """Test that errors during streaming are properly propagated"""
        # Arrange
        mock_prompt_client = mock_flow_context("prompt-request")
        mock_prompt_client.agent_react.side_effect = Exception("Prompt service error")
        think = AsyncMock()
        observe = AsyncMock()
        # Act & Assert
        with pytest.raises(Exception) as exc_info:
            await agent_manager.react(
                question="test question",
                history=[],
                think=think,
                observe=observe,
                context=mock_flow_context,
                streaming=True
            )
        assert "Prompt service error" in str(exc_info.value)
    @pytest.mark.asyncio
    async def test_agent_streaming_multi_step_reasoning(self, agent_manager, mock_flow_context,
                                                         mock_prompt_client_streaming):
        """Test streaming through multi-step reasoning process"""
        # Arrange - Mock a multi-step response
        step_responses = [
            """Thought: I need to search for basic information.
 Action: knowledge_query
 Args: {"question": "What is AI?"}""",
            """Thought: Now I can answer the question.
 Final Answer: AI is the simulation of human intelligence in machines."""
        ]
        call_count = 0
        async def multi_step_agent_react(variables, timeout=600, streaming=False, chunk_callback=None):
            nonlocal call_count
            response = step_responses[min(call_count, len(step_responses) - 1)]
            call_count += 1
            if streaming and chunk_callback:
                for chunk in response.split():
                    await chunk_callback(chunk + " ")
                return response
            return response
        mock_prompt_client_streaming.agent_react.side_effect = multi_step_agent_react
        think = AsyncMock()
        observe = AsyncMock()
        # Act
        result = await agent_manager.react(
            question="What is artificial intelligence?",
            history=[],
            think=think,
            observe=observe,
            context=mock_flow_context,
            streaming=True
        )
        # Assert
        assert result is not None
        assert think.call_count > 0
    @pytest.mark.asyncio
    async def test_agent_streaming_preserves_tool_config(self, agent_manager, mock_flow_context):
        """Test that streaming preserves tool configuration and context"""
        # Arrange
        think = AsyncMock()
        observe = AsyncMock()
        # Act
        await agent_manager.react(
            question="What is machine learning?",
            history=[],
            think=think,
            observe=observe,
            context=mock_flow_context,
            streaming=True
        )
        # Assert - Verify prompt client was called with streaming
        mock_prompt_client = mock_flow_context("prompt-request")
        call_args = mock_prompt_client.agent_react.call_args
        assert call_args.kwargs['streaming'] is True
        assert call_args.kwargs['chunk_callback'] is not None
    @pytest.mark.asyncio
    async def test_agent_streaming_end_of_message_flags(self, agent_manager, mock_flow_context):
        """Test that end_of_message flags are correctly set for thought chunks"""
        # Arrange
        thought_calls = []
        async def think(chunk, is_final=False):
            thought_calls.append({
                'chunk': chunk,
                'is_final': is_final
            })
        # Act
        await agent_manager.react(
            question="What is machine learning?",
            history=[],
            think=think,
            observe=AsyncMock(),
            context=mock_flow_context,
            streaming=True
        )
        # Assert
        assert len(thought_calls) > 0, "Expected thought chunks to be sent"
        # All chunks except the last should have is_final=False
        for i, call in enumerate(thought_calls[:-1]):
            assert call['is_final'] is False, \
                f"Thought chunk {i} should have is_final=False, got {call['is_final']}"
        # Last chunk should have is_final=True
        last_call = thought_calls[-1]
        assert last_call['is_final'] is True, \
            f"Last thought chunk should have is_final=True, got {last_call['is_final']}"
--- a/tests/integration/test_document_rag_streaming_integration.py
+++ b/tests/integration/test_document_rag_streaming_integration.py
@ -0,0 +1,274 @@
 """
 Integration tests for DocumentRAG streaming functionality
 These tests verify the streaming behavior of DocumentRAG, testing token-by-token
 response delivery through the complete pipeline.
 """
 import pytest
 from unittest.mock import AsyncMock
 from trustgraph.retrieval.document_rag.document_rag import DocumentRag
 from tests.utils.streaming_assertions import (
    assert_streaming_chunks_valid,
    assert_callback_invoked,
 )
@pytest.mark.integration
 class TestDocumentRagStreaming:
    """Integration tests for DocumentRAG streaming"""
    @pytest.fixture
    def mock_embeddings_client(self):
        """Mock embeddings client"""
        client = AsyncMock()
        client.embed.return_value = [[0.1, 0.2, 0.3, 0.4, 0.5]]
        return client
    @pytest.fixture
    def mock_doc_embeddings_client(self):
        """Mock document embeddings client"""
        client = AsyncMock()
        client.query.return_value = [
            "Machine learning is a subset of AI.",
            "Deep learning uses neural networks.",
            "Supervised learning needs labeled data."
        ]
        return client
    @pytest.fixture
    def mock_streaming_prompt_client(self, mock_streaming_llm_response):
        """Mock prompt client with streaming support"""
        client = AsyncMock()
        async def document_prompt_side_effect(query, documents, timeout=600, streaming=False, chunk_callback=None):
            # Both modes return the same text
            full_text = "Machine learning is a subset of artificial intelligence that focuses on algorithms that learn from data."
            if streaming and chunk_callback:
                # Simulate streaming chunks
                async for chunk in mock_streaming_llm_response():
                    await chunk_callback(chunk)
                return full_text
            else:
                # Non-streaming response - same text
                return full_text
        client.document_prompt.side_effect = document_prompt_side_effect
        return client
    @pytest.fixture
    def document_rag_streaming(self, mock_embeddings_client, mock_doc_embeddings_client,
                                mock_streaming_prompt_client):
        """Create DocumentRag instance with streaming support"""
        return DocumentRag(
            embeddings_client=mock_embeddings_client,
            doc_embeddings_client=mock_doc_embeddings_client,
            prompt_client=mock_streaming_prompt_client,
            verbose=True
        )
    @pytest.mark.asyncio
    async def test_document_rag_streaming_basic(self, document_rag_streaming, streaming_chunk_collector):
        """Test basic DocumentRAG streaming functionality"""
        # Arrange
        query = "What is machine learning?"
        collector = streaming_chunk_collector()
        # Act
        result = await document_rag_streaming.query(
            query=query,
            user="test_user",
            collection="test_collection",
            doc_limit=10,
            streaming=True,
            chunk_callback=collector.collect
        )
        # Assert
        assert_streaming_chunks_valid(collector.chunks, min_chunks=1)
        assert_callback_invoked(AsyncMock(call_count=len(collector.chunks)), min_calls=1)
        # Verify full response matches concatenated chunks
        full_from_chunks = collector.get_full_text()
        assert result == full_from_chunks
        # Verify content is reasonable
        assert len(result) > 0
    @pytest.mark.asyncio
    async def test_document_rag_streaming_vs_non_streaming(self, document_rag_streaming):
        """Test that streaming and non-streaming produce equivalent results"""
        # Arrange
        query = "What is machine learning?"
        user = "test_user"
        collection = "test_collection"
        doc_limit = 10
        # Act - Non-streaming
        non_streaming_result = await document_rag_streaming.query(
            query=query,
            user=user,
            collection=collection,
            doc_limit=doc_limit,
            streaming=False
        )
        # Act - Streaming
        streaming_chunks = []
        async def collect(chunk):
            streaming_chunks.append(chunk)
        streaming_result = await document_rag_streaming.query(
            query=query,
            user=user,
            collection=collection,
            doc_limit=doc_limit,
            streaming=True,
            chunk_callback=collect
        )
        # Assert - Results should be equivalent
        assert streaming_result == non_streaming_result
        assert len(streaming_chunks) > 0
        assert "".join(streaming_chunks) == streaming_result
    @pytest.mark.asyncio
    async def test_document_rag_streaming_callback_invocation(self, document_rag_streaming):
        """Test that chunk callback is invoked correctly"""
        # Arrange
        callback = AsyncMock()
        # Act
        result = await document_rag_streaming.query(
            query="test query",
            user="test_user",
            collection="test_collection",
            doc_limit=5,
            streaming=True,
            chunk_callback=callback
        )
        # Assert
        assert callback.call_count > 0
        assert result is not None
        # Verify all callback invocations had string arguments
        for call in callback.call_args_list:
            assert isinstance(call.args[0], str)
    @pytest.mark.asyncio
    async def test_document_rag_streaming_without_callback(self, document_rag_streaming):
        """Test streaming parameter without callback (should fall back to non-streaming)"""
        # Arrange & Act
        result = await document_rag_streaming.query(
            query="test query",
            user="test_user",
            collection="test_collection",
            doc_limit=5,
            streaming=True,
            chunk_callback=None  # No callback provided
        )
        # Assert - Should complete without error
        assert result is not None
        assert isinstance(result, str)
    @pytest.mark.asyncio
    async def test_document_rag_streaming_with_no_documents(self, document_rag_streaming,
                                                             mock_doc_embeddings_client):
        """Test streaming with no documents found"""
        # Arrange
        mock_doc_embeddings_client.query.return_value = []  # No documents
        callback = AsyncMock()
        # Act
        result = await document_rag_streaming.query(
            query="unknown topic",
            user="test_user",
            collection="test_collection",
            doc_limit=10,
            streaming=True,
            chunk_callback=callback
        )
        # Assert - Should still produce streamed response
        assert result is not None
        assert callback.call_count > 0
    @pytest.mark.asyncio
    async def test_document_rag_streaming_error_propagation(self, document_rag_streaming,
                                                              mock_embeddings_client):
        """Test that errors during streaming are properly propagated"""
        # Arrange
        mock_embeddings_client.embed.side_effect = Exception("Embeddings error")
        callback = AsyncMock()
        # Act & Assert
        with pytest.raises(Exception) as exc_info:
            await document_rag_streaming.query(
                query="test query",
                user="test_user",
                collection="test_collection",
                doc_limit=5,
                streaming=True,
                chunk_callback=callback
            )
        assert "Embeddings error" in str(exc_info.value)
    @pytest.mark.asyncio
    async def test_document_rag_streaming_with_different_doc_limits(self, document_rag_streaming,
                                                                      mock_doc_embeddings_client):
        """Test streaming with various document limits"""
        # Arrange
        callback = AsyncMock()
        doc_limits = [1, 5, 10, 20]
        for limit in doc_limits:
            # Reset mocks
            mock_doc_embeddings_client.reset_mock()
            callback.reset_mock()
            # Act
            result = await document_rag_streaming.query(
                query="test query",
                user="test_user",
                collection="test_collection",
                doc_limit=limit,
                streaming=True,
                chunk_callback=callback
            )
            # Assert
            assert result is not None
            assert callback.call_count > 0
            # Verify doc_limit was passed correctly
            call_args = mock_doc_embeddings_client.query.call_args
            assert call_args.kwargs['limit'] == limit
    @pytest.mark.asyncio
    async def test_document_rag_streaming_preserves_user_collection(self, document_rag_streaming,
                                                                      mock_doc_embeddings_client):
        """Test that streaming preserves user/collection isolation"""
        # Arrange
        callback = AsyncMock()
        user = "test_user_123"
        collection = "test_collection_456"
        # Act
        await document_rag_streaming.query(
            query="test query",
            user=user,
            collection=collection,
            doc_limit=10,
            streaming=True,
            chunk_callback=callback
        )
        # Assert - Verify user/collection were passed to document embeddings client
        call_args = mock_doc_embeddings_client.query.call_args
        assert call_args.kwargs['user'] == user
        assert call_args.kwargs['collection'] == collection
--- a/tests/integration/test_graph_rag_integration.py
+++ b/tests/integration/test_graph_rag_integration.py
@ -0,0 +1,269 @@
 """
 Integration tests for GraphRAG retrieval system
 These tests verify the end-to-end functionality of the GraphRAG system,
 testing the coordination between embeddings, graph retrieval, triple querying, and prompt services.
 Following the TEST_STRATEGY.md approach for integration testing.
 NOTE: This is the first integration test file for GraphRAG (previously had only unit tests).
 """
 import pytest
 from unittest.mock import AsyncMock, MagicMock
 from trustgraph.retrieval.graph_rag.graph_rag import GraphRag
@pytest.mark.integration
 class TestGraphRagIntegration:
    """Integration tests for GraphRAG system coordination"""
    @pytest.fixture
    def mock_embeddings_client(self):
        """Mock embeddings client that returns realistic vector embeddings"""
        client = AsyncMock()
        client.embed.return_value = [
            [0.1, 0.2, 0.3, 0.4, 0.5],  # Realistic 5-dimensional embedding
        ]
        return client
    @pytest.fixture
    def mock_graph_embeddings_client(self):
        """Mock graph embeddings client that returns realistic entities"""
        client = AsyncMock()
        client.query.return_value = [
            "http://trustgraph.ai/e/machine-learning",
            "http://trustgraph.ai/e/artificial-intelligence",
            "http://trustgraph.ai/e/neural-networks"
        ]
        return client
    @pytest.fixture
    def mock_triples_client(self):
        """Mock triples client that returns realistic knowledge graph triples"""
        client = AsyncMock()
        # Mock different queries return different triples
        async def query_side_effect(s=None, p=None, o=None, limit=None, user=None, collection=None):
            # Mock label queries
            if p == "http://www.w3.org/2000/01/rdf-schema#label":
                if s == "http://trustgraph.ai/e/machine-learning":
                    return [MagicMock(s=s, p=p, o="Machine Learning")]
                elif s == "http://trustgraph.ai/e/artificial-intelligence":
                    return [MagicMock(s=s, p=p, o="Artificial Intelligence")]
                elif s == "http://trustgraph.ai/e/neural-networks":
                    return [MagicMock(s=s, p=p, o="Neural Networks")]
                return []
            # Mock relationship queries
            if s == "http://trustgraph.ai/e/machine-learning":
                return [
                    MagicMock(
                        s="http://trustgraph.ai/e/machine-learning",
                        p="http://trustgraph.ai/is_subset_of",
                        o="http://trustgraph.ai/e/artificial-intelligence"
                    ),
                    MagicMock(
                        s="http://trustgraph.ai/e/machine-learning",
                        p="http://www.w3.org/2000/01/rdf-schema#label",
                        o="Machine Learning"
                    )
                ]
            return []
        client.query.side_effect = query_side_effect
        return client
    @pytest.fixture
    def mock_prompt_client(self):
        """Mock prompt client that generates realistic responses"""
        client = AsyncMock()
        client.kg_prompt.return_value = (
            "Machine learning is a subset of artificial intelligence that enables computers "
            "to learn from data without being explicitly programmed. It uses algorithms "
            "and statistical models to find patterns in data."
        )
        return client
    @pytest.fixture
    def graph_rag(self, mock_embeddings_client, mock_graph_embeddings_client,
                  mock_triples_client, mock_prompt_client):
        """Create GraphRag instance with mocked dependencies"""
        return GraphRag(
            embeddings_client=mock_embeddings_client,
            graph_embeddings_client=mock_graph_embeddings_client,
            triples_client=mock_triples_client,
            prompt_client=mock_prompt_client,
            verbose=True
        )
    @pytest.mark.asyncio
    async def test_graph_rag_end_to_end_flow(self, graph_rag, mock_embeddings_client,
                                             mock_graph_embeddings_client, mock_triples_client,
                                             mock_prompt_client):
        """Test complete GraphRAG pipeline from query to response"""
        # Arrange
        query = "What is machine learning?"
        user = "test_user"
        collection = "ml_knowledge"
        entity_limit = 50
        triple_limit = 30
        # Act
        result = await graph_rag.query(
            query=query,
            user=user,
            collection=collection,
            entity_limit=entity_limit,
            triple_limit=triple_limit
        )
        # Assert - Verify service coordination
        # 1. Should compute embeddings for query
        mock_embeddings_client.embed.assert_called_once_with(query)
        # 2. Should query graph embeddings to find relevant entities
        mock_graph_embeddings_client.query.assert_called_once()
        call_args = mock_graph_embeddings_client.query.call_args
        assert call_args.kwargs['vectors'] == [[0.1, 0.2, 0.3, 0.4, 0.5]]
        assert call_args.kwargs['limit'] == entity_limit
        assert call_args.kwargs['user'] == user
        assert call_args.kwargs['collection'] == collection
        # 3. Should query triples to build knowledge subgraph
        assert mock_triples_client.query.call_count > 0
        # 4. Should call prompt with knowledge graph
        mock_prompt_client.kg_prompt.assert_called_once()
        call_args = mock_prompt_client.kg_prompt.call_args
        assert call_args.args[0] == query  # First arg is query
        assert isinstance(call_args.args[1], list)  # Second arg is kg (list of triples)
        # Verify final response
        assert result is not None
        assert isinstance(result, str)
        assert "machine learning" in result.lower()
    @pytest.mark.asyncio
    async def test_graph_rag_with_different_limits(self, graph_rag, mock_embeddings_client,
                                                    mock_graph_embeddings_client):
        """Test GraphRAG with various entity and triple limits"""
        # Arrange
        query = "Explain neural networks"
        test_configs = [
            {"entity_limit": 10, "triple_limit": 10},
            {"entity_limit": 50, "triple_limit": 30},
            {"entity_limit": 100, "triple_limit": 100},
        ]
        for config in test_configs:
            # Reset mocks
            mock_embeddings_client.reset_mock()
            mock_graph_embeddings_client.reset_mock()
            # Act
            await graph_rag.query(
                query=query,
                user="test_user",
                collection="test_collection",
                entity_limit=config["entity_limit"],
                triple_limit=config["triple_limit"]
            )
            # Assert
            call_args = mock_graph_embeddings_client.query.call_args
            assert call_args.kwargs['limit'] == config["entity_limit"]
    @pytest.mark.asyncio
    async def test_graph_rag_error_propagation(self, graph_rag, mock_embeddings_client):
        """Test that errors from underlying services are properly propagated"""
        # Arrange
        mock_embeddings_client.embed.side_effect = Exception("Embeddings service error")
        # Act & Assert
        with pytest.raises(Exception) as exc_info:
            await graph_rag.query(
                query="test query",
                user="test_user",
                collection="test_collection"
            )
        assert "Embeddings service error" in str(exc_info.value)
    @pytest.mark.asyncio
    async def test_graph_rag_with_empty_knowledge_graph(self, graph_rag, mock_graph_embeddings_client,
                                                         mock_triples_client, mock_prompt_client):
        """Test GraphRAG handles empty knowledge graph gracefully"""
        # Arrange
        mock_graph_embeddings_client.query.return_value = []  # No entities found
        mock_triples_client.query.return_value = []  # No triples found
        # Act
        result = await graph_rag.query(
            query="unknown topic",
            user="test_user",
            collection="test_collection"
        )
        # Assert
        # Should still call prompt client with empty knowledge graph
        mock_prompt_client.kg_prompt.assert_called_once()
        call_args = mock_prompt_client.kg_prompt.call_args
        assert isinstance(call_args.args[1], list)  # kg should be a list
        assert result is not None
    @pytest.mark.asyncio
    async def test_graph_rag_label_caching(self, graph_rag, mock_triples_client):
        """Test that label lookups are cached to reduce redundant queries"""
        # Arrange
        query = "What is machine learning?"
        # First query
        await graph_rag.query(
            query=query,
            user="test_user",
            collection="test_collection"
        )
        first_call_count = mock_triples_client.query.call_count
        mock_triples_client.reset_mock()
        # Second identical query
        await graph_rag.query(
            query=query,
            user="test_user",
            collection="test_collection"
        )
        second_call_count = mock_triples_client.query.call_count
        # Assert - Second query should make fewer triple queries due to caching
        # Note: This is a weak assertion because caching behavior depends on
        # implementation details, but it verifies the concept
        assert second_call_count >= 0  # Should complete without errors
    @pytest.mark.asyncio
    async def test_graph_rag_multi_user_isolation(self, graph_rag, mock_graph_embeddings_client):
        """Test that different users/collections are properly isolated"""
        # Arrange
        query = "test query"
        user1, collection1 = "user1", "collection1"
        user2, collection2 = "user2", "collection2"
        # Act
        await graph_rag.query(query=query, user=user1, collection=collection1)
        await graph_rag.query(query=query, user=user2, collection=collection2)
        # Assert - Both users should have separate queries
        assert mock_graph_embeddings_client.query.call_count == 2
        # Verify first call
        first_call = mock_graph_embeddings_client.query.call_args_list[0]
        assert first_call.kwargs['user'] == user1
        assert first_call.kwargs['collection'] == collection1
        # Verify second call
        second_call = mock_graph_embeddings_client.query.call_args_list[1]
        assert second_call.kwargs['user'] == user2
        assert second_call.kwargs['collection'] == collection2
--- a/tests/integration/test_graph_rag_streaming_integration.py
+++ b/tests/integration/test_graph_rag_streaming_integration.py
@ -0,0 +1,249 @@
 """
 Integration tests for GraphRAG streaming functionality
 These tests verify the streaming behavior of GraphRAG, testing token-by-token
 response delivery through the complete pipeline.
 """
 import pytest
 from unittest.mock import AsyncMock, MagicMock
 from trustgraph.retrieval.graph_rag.graph_rag import GraphRag
 from tests.utils.streaming_assertions import (
    assert_streaming_chunks_valid,
    assert_rag_streaming_chunks,
    assert_streaming_content_matches,
    assert_callback_invoked,
 )
@pytest.mark.integration
 class TestGraphRagStreaming:
    """Integration tests for GraphRAG streaming"""
    @pytest.fixture
    def mock_embeddings_client(self):
        """Mock embeddings client"""
        client = AsyncMock()
        client.embed.return_value = [[0.1, 0.2, 0.3, 0.4, 0.5]]
        return client
    @pytest.fixture
    def mock_graph_embeddings_client(self):
        """Mock graph embeddings client"""
        client = AsyncMock()
        client.query.return_value = [
            "http://trustgraph.ai/e/machine-learning",
        ]
        return client
    @pytest.fixture
    def mock_triples_client(self):
        """Mock triples client with minimal responses"""
        client = AsyncMock()
        async def query_side_effect(s=None, p=None, o=None, limit=None, user=None, collection=None):
            if p == "http://www.w3.org/2000/01/rdf-schema#label":
                return [MagicMock(s=s, p=p, o="Machine Learning")]
            return []
        client.query.side_effect = query_side_effect
        return client
    @pytest.fixture
    def mock_streaming_prompt_client(self, mock_streaming_llm_response):
        """Mock prompt client with streaming support"""
        client = AsyncMock()
        async def kg_prompt_side_effect(query, kg, timeout=600, streaming=False, chunk_callback=None):
            # Both modes return the same text
            full_text = "Machine learning is a subset of artificial intelligence that focuses on algorithms that learn from data."
            if streaming and chunk_callback:
                # Simulate streaming chunks
                async for chunk in mock_streaming_llm_response():
                    await chunk_callback(chunk)
                return full_text
            else:
                # Non-streaming response - same text
                return full_text
        client.kg_prompt.side_effect = kg_prompt_side_effect
        return client
    @pytest.fixture
    def graph_rag_streaming(self, mock_embeddings_client, mock_graph_embeddings_client,
                            mock_triples_client, mock_streaming_prompt_client):
        """Create GraphRag instance with streaming support"""
        return GraphRag(
            embeddings_client=mock_embeddings_client,
            graph_embeddings_client=mock_graph_embeddings_client,
            triples_client=mock_triples_client,
            prompt_client=mock_streaming_prompt_client,
            verbose=True
        )
    @pytest.mark.asyncio
    async def test_graph_rag_streaming_basic(self, graph_rag_streaming, streaming_chunk_collector):
        """Test basic GraphRAG streaming functionality"""
        # Arrange
        query = "What is machine learning?"
        collector = streaming_chunk_collector()
        # Act
        result = await graph_rag_streaming.query(
            query=query,
            user="test_user",
            collection="test_collection",
            streaming=True,
            chunk_callback=collector.collect
        )
        # Assert
        assert_streaming_chunks_valid(collector.chunks, min_chunks=1)
        assert_callback_invoked(AsyncMock(call_count=len(collector.chunks)), min_calls=1)
        # Verify full response matches concatenated chunks
        full_from_chunks = collector.get_full_text()
        assert result == full_from_chunks
        # Verify content is reasonable
        assert "machine" in result.lower() or "learning" in result.lower()
    @pytest.mark.asyncio
    async def test_graph_rag_streaming_vs_non_streaming(self, graph_rag_streaming):
        """Test that streaming and non-streaming produce equivalent results"""
        # Arrange
        query = "What is machine learning?"
        user = "test_user"
        collection = "test_collection"
        # Act - Non-streaming
        non_streaming_result = await graph_rag_streaming.query(
            query=query,
            user=user,
            collection=collection,
            streaming=False
        )
        # Act - Streaming
        streaming_chunks = []
        async def collect(chunk):
            streaming_chunks.append(chunk)
        streaming_result = await graph_rag_streaming.query(
            query=query,
            user=user,
            collection=collection,
            streaming=True,
            chunk_callback=collect
        )
        # Assert - Results should be equivalent
        assert streaming_result == non_streaming_result
        assert len(streaming_chunks) > 0
        assert "".join(streaming_chunks) == streaming_result
    @pytest.mark.asyncio
    async def test_graph_rag_streaming_callback_invocation(self, graph_rag_streaming):
        """Test that chunk callback is invoked correctly"""
        # Arrange
        callback = AsyncMock()
        # Act
        result = await graph_rag_streaming.query(
            query="test query",
            user="test_user",
            collection="test_collection",
            streaming=True,
            chunk_callback=callback
        )
        # Assert
        assert callback.call_count > 0
        assert result is not None
        # Verify all callback invocations had string arguments
        for call in callback.call_args_list:
            assert isinstance(call.args[0], str)
    @pytest.mark.asyncio
    async def test_graph_rag_streaming_without_callback(self, graph_rag_streaming):
        """Test streaming parameter without callback (should fall back to non-streaming)"""
        # Arrange & Act
        result = await graph_rag_streaming.query(
            query="test query",
            user="test_user",
            collection="test_collection",
            streaming=True,
            chunk_callback=None  # No callback provided
        )
        # Assert - Should complete without error
        assert result is not None
        assert isinstance(result, str)
    @pytest.mark.asyncio
    async def test_graph_rag_streaming_with_empty_kg(self, graph_rag_streaming,
                                                      mock_graph_embeddings_client):
        """Test streaming with empty knowledge graph"""
        # Arrange
        mock_graph_embeddings_client.query.return_value = []  # No entities
        callback = AsyncMock()
        # Act
        result = await graph_rag_streaming.query(
            query="unknown topic",
            user="test_user",
            collection="test_collection",
            streaming=True,
            chunk_callback=callback
        )
        # Assert - Should still produce streamed response
        assert result is not None
        assert callback.call_count > 0
    @pytest.mark.asyncio
    async def test_graph_rag_streaming_error_propagation(self, graph_rag_streaming,
                                                          mock_embeddings_client):
        """Test that errors during streaming are properly propagated"""
        # Arrange
        mock_embeddings_client.embed.side_effect = Exception("Embeddings error")
        callback = AsyncMock()
        # Act & Assert
        with pytest.raises(Exception) as exc_info:
            await graph_rag_streaming.query(
                query="test query",
                user="test_user",
                collection="test_collection",
                streaming=True,
                chunk_callback=callback
            )
        assert "Embeddings error" in str(exc_info.value)
    @pytest.mark.asyncio
    async def test_graph_rag_streaming_preserves_parameters(self, graph_rag_streaming,
                                                             mock_graph_embeddings_client):
        """Test that streaming preserves all query parameters"""
        # Arrange
        callback = AsyncMock()
        entity_limit = 25
        triple_limit = 15
        # Act
        await graph_rag_streaming.query(
            query="test query",
            user="test_user",
            collection="test_collection",
            entity_limit=entity_limit,
            triple_limit=triple_limit,
            streaming=True,
            chunk_callback=callback
        )
        # Assert - Verify parameters were passed to underlying services
        call_args = mock_graph_embeddings_client.query.call_args
        assert call_args.kwargs['limit'] == entity_limit
--- a/tests/integration/test_prompt_streaming_integration.py
+++ b/tests/integration/test_prompt_streaming_integration.py
@ -0,0 +1,404 @@
 """
 Integration tests for Prompt Service Streaming Functionality
 These tests verify the streaming behavior of the Prompt service,
 testing how it coordinates between templates and text completion streaming.
 """
 import pytest
 from unittest.mock import AsyncMock, MagicMock
 from trustgraph.prompt.template.service import Processor
 from trustgraph.schema import PromptRequest, PromptResponse, TextCompletionResponse
 from tests.utils.streaming_assertions import (
    assert_streaming_chunks_valid,
    assert_callback_invoked,
 )
@pytest.mark.integration
 class TestPromptStreaming:
    """Integration tests for Prompt service streaming"""
    @pytest.fixture
    def mock_flow_context_streaming(self):
        """Mock flow context with streaming text completion support"""
        context = MagicMock()
        # Mock text completion client with streaming
        text_completion_client = AsyncMock()
        async def streaming_request(request, recipient=None, timeout=600):
            """Simulate streaming text completion"""
            if request.streaming and recipient:
                # Simulate streaming chunks
                chunks = [
                    "Machine", " learning", " is", " a", " field",
                    " of", " artificial", " intelligence", "."
                ]
                for i, chunk_text in enumerate(chunks):
                    is_final = (i == len(chunks) - 1)
                    response = TextCompletionResponse(
                        response=chunk_text,
                        error=None,
                        end_of_stream=is_final
                    )
                    final = await recipient(response)
                    if final:
                        break
                # Final empty chunk
                await recipient(TextCompletionResponse(
                    response="",
                    error=None,
                    end_of_stream=True
                ))
        text_completion_client.request = streaming_request
        # Mock response producer
        response_producer = AsyncMock()
        def context_router(service_name):
            if service_name == "text-completion-request":
                return text_completion_client
            elif service_name == "response":
                return response_producer
            else:
                return AsyncMock()
        context.side_effect = context_router
        return context
    @pytest.fixture
    def mock_prompt_manager(self):
        """Mock PromptManager with simple template"""
        manager = MagicMock()
        async def invoke_template(kind, input_vars, llm_function):
            """Simulate template invocation"""
            # Call the LLM function with simple prompts
            system = "You are a helpful assistant."
            prompt = f"Question: {input_vars.get('question', 'test')}"
            result = await llm_function(system, prompt)
            return result
        manager.invoke = invoke_template
        return manager
    @pytest.fixture
    def prompt_processor_streaming(self, mock_prompt_manager):
        """Create Prompt processor with streaming support"""
        processor = MagicMock()
        processor.manager = mock_prompt_manager
        processor.config_key = "prompt"
        # Bind the actual on_request method
        processor.on_request = Processor.on_request.__get__(processor, Processor)
        return processor
    @pytest.mark.asyncio
    async def test_prompt_streaming_basic(self, prompt_processor_streaming, mock_flow_context_streaming):
        """Test basic prompt streaming functionality"""
        # Arrange
        request = PromptRequest(
            id="kg_prompt",
            terms={"question": '"What is machine learning?"'},
            streaming=True
        )
        message = MagicMock()
        message.value.return_value = request
        message.properties.return_value = {"id": "test-123"}
        consumer = MagicMock()
        # Act
        await prompt_processor_streaming.on_request(
            message, consumer, mock_flow_context_streaming
        )
        # Assert
        # Verify response producer was called multiple times (for streaming chunks)
        response_producer = mock_flow_context_streaming("response")
        assert response_producer.send.call_count > 0
        # Verify streaming chunks were sent
        calls = response_producer.send.call_args_list
        assert len(calls) > 1  # Should have multiple chunks
        # Check that responses have end_of_stream flag
        for call in calls:
            response = call.args[0]
            assert isinstance(response, PromptResponse)
            assert hasattr(response, 'end_of_stream')
        # Last response should have end_of_stream=True
        last_call = calls[-1]
        last_response = last_call.args[0]
        assert last_response.end_of_stream is True
    @pytest.mark.asyncio
    async def test_prompt_streaming_non_streaming_mode(self, prompt_processor_streaming,
                                                         mock_flow_context_streaming):
        """Test prompt service in non-streaming mode"""
        # Arrange
        request = PromptRequest(
            id="kg_prompt",
            terms={"question": '"What is AI?"'},
            streaming=False  # Non-streaming
        )
        message = MagicMock()
        message.value.return_value = request
        message.properties.return_value = {"id": "test-456"}
        consumer = MagicMock()
        # Mock non-streaming text completion
        text_completion_client = mock_flow_context_streaming("text-completion-request")
        async def non_streaming_text_completion(system, prompt, streaming=False):
            return "AI is the simulation of human intelligence in machines."
        text_completion_client.text_completion = non_streaming_text_completion
        # Act
        await prompt_processor_streaming.on_request(
            message, consumer, mock_flow_context_streaming
        )
        # Assert
        # Verify response producer was called once (non-streaming)
        response_producer = mock_flow_context_streaming("response")
        # Note: In non-streaming mode, the service sends a single response
        assert response_producer.send.call_count >= 1
    @pytest.mark.asyncio
    async def test_prompt_streaming_chunk_forwarding(self, prompt_processor_streaming,
                                                       mock_flow_context_streaming):
        """Test that prompt service forwards chunks immediately"""
        # Arrange
        request = PromptRequest(
            id="test_prompt",
            terms={"question": '"Test query"'},
            streaming=True
        )
        message = MagicMock()
        message.value.return_value = request
        message.properties.return_value = {"id": "test-789"}
        consumer = MagicMock()
        # Act
        await prompt_processor_streaming.on_request(
            message, consumer, mock_flow_context_streaming
        )
        # Assert
        # Verify chunks were forwarded with proper structure
        response_producer = mock_flow_context_streaming("response")
        calls = response_producer.send.call_args_list
        for call in calls:
            response = call.args[0]
            # Each response should have text and end_of_stream fields
            assert hasattr(response, 'text')
            assert hasattr(response, 'end_of_stream')
    @pytest.mark.asyncio
    async def test_prompt_streaming_error_handling(self, prompt_processor_streaming):
        """Test error handling during streaming"""
        # Arrange
        from trustgraph.schema import Error
        context = MagicMock()
        # Mock text completion client that raises an error
        text_completion_client = AsyncMock()
        async def failing_request(request, recipient=None, timeout=600):
            if recipient:
                # Send error response with proper Error schema
                error_response = TextCompletionResponse(
                    response="",
                    error=Error(message="Text completion error", type="processing_error"),
                    end_of_stream=True
                )
                await recipient(error_response)
        text_completion_client.request = failing_request
        # Mock response producer to capture error response
        response_producer = AsyncMock()
        def context_router(service_name):
            if service_name == "text-completion-request":
                return text_completion_client
            elif service_name == "response":
                return response_producer
            else:
                return AsyncMock()
        context.side_effect = context_router
        request = PromptRequest(
            id="test_prompt",
            terms={"question": '"Test"'},
            streaming=True
        )
        message = MagicMock()
        message.value.return_value = request
        message.properties.return_value = {"id": "test-error"}
        consumer = MagicMock()
        # Act - The service catches errors and sends error responses, doesn't raise
        await prompt_processor_streaming.on_request(message, consumer, context)
        # Assert - Verify error response was sent
        assert response_producer.send.call_count > 0
        # Check that at least one response contains an error
        error_sent = False
        for call in response_producer.send.call_args_list:
            response = call.args[0]
            if hasattr(response, 'error') and response.error:
                error_sent = True
                assert "Text completion error" in response.error.message
                break
        assert error_sent, "Expected error response to be sent"
    @pytest.mark.asyncio
    async def test_prompt_streaming_preserves_message_id(self, prompt_processor_streaming,
                                                           mock_flow_context_streaming):
        """Test that message IDs are preserved through streaming"""
        # Arrange
        message_id = "unique-test-id-12345"
        request = PromptRequest(
            id="test_prompt",
            terms={"question": '"Test"'},
            streaming=True
        )
        message = MagicMock()
        message.value.return_value = request
        message.properties.return_value = {"id": message_id}
        consumer = MagicMock()
        # Act
        await prompt_processor_streaming.on_request(
            message, consumer, mock_flow_context_streaming
        )
        # Assert
        # Verify all responses were sent with the correct message ID
        response_producer = mock_flow_context_streaming("response")
        calls = response_producer.send.call_args_list
        for call in calls:
            properties = call.kwargs.get('properties')
            assert properties is not None
            assert properties['id'] == message_id
    @pytest.mark.asyncio
    async def test_prompt_streaming_empty_response_handling(self, prompt_processor_streaming):
        """Test handling of empty responses during streaming"""
        # Arrange
        context = MagicMock()
        # Mock text completion that sends empty chunks
        text_completion_client = AsyncMock()
        async def empty_streaming_request(request, recipient=None, timeout=600):
            if request.streaming and recipient:
                # Send empty chunk followed by final marker
                await recipient(TextCompletionResponse(
                    response="",
                    error=None,
                    end_of_stream=False
                ))
                await recipient(TextCompletionResponse(
                    response="",
                    error=None,
                    end_of_stream=True
                ))
        text_completion_client.request = empty_streaming_request
        response_producer = AsyncMock()
        def context_router(service_name):
            if service_name == "text-completion-request":
                return text_completion_client
            elif service_name == "response":
                return response_producer
            else:
                return AsyncMock()
        context.side_effect = context_router
        request = PromptRequest(
            id="test_prompt",
            terms={"question": '"Test"'},
            streaming=True
        )
        message = MagicMock()
        message.value.return_value = request
        message.properties.return_value = {"id": "test-empty"}
        consumer = MagicMock()
        # Act
        await prompt_processor_streaming.on_request(message, consumer, context)
        # Assert
        # Should still send responses even if empty (including final marker)
        assert response_producer.send.call_count > 0
        # Last response should have end_of_stream=True
        last_call = response_producer.send.call_args_list[-1]
        last_response = last_call.args[0]
        assert last_response.end_of_stream is True
    @pytest.mark.asyncio
    async def test_prompt_streaming_concatenation_matches_complete(self, prompt_processor_streaming,
                                                                     mock_flow_context_streaming):
        """Test that streaming chunks concatenate to form complete response"""
        # Arrange
        request = PromptRequest(
            id="test_prompt",
            terms={"question": '"What is ML?"'},
            streaming=True
        )
        message = MagicMock()
        message.value.return_value = request
        message.properties.return_value = {"id": "test-concat"}
        consumer = MagicMock()
        # Act
        await prompt_processor_streaming.on_request(
            message, consumer, mock_flow_context_streaming
        )
        # Assert
        # Collect all response texts
        response_producer = mock_flow_context_streaming("response")
        calls = response_producer.send.call_args_list
        chunk_texts = []
        for call in calls:
            response = call.args[0]
            if response.text and not response.end_of_stream:
                chunk_texts.append(response.text)
        # Verify chunks concatenate to expected result
        full_text = "".join(chunk_texts)
        assert full_text == "Machine learning is a field of artificial intelligence"
--- a/tests/integration/test_text_completion_integration.py
+++ b/tests/integration/test_text_completion_integration.py
@ -282,10 +282,11 @@ class TestTextCompletionIntegration:
        # Assert
        # Verify OpenAI API call parameters
        call_args = mock_openai_client.chat.completions.create.call_args
-        assert call_args.kwargs['response_format'] == {"type": "text"}
+        # Note: response_format, top_p, frequency_penalty, and presence_penalty
-        assert call_args.kwargs['top_p'] == 1
+        # were removed in #561 as unnecessary parameters
-        assert call_args.kwargs['frequency_penalty'] == 0
+        assert 'model' in call_args.kwargs
-        assert call_args.kwargs['presence_penalty'] == 0
+        assert 'temperature' in call_args.kwargs
        assert 'max_tokens' in call_args.kwargs
        # Verify result structure
        assert hasattr(result, 'text')
@ -362,9 +363,8 @@ class TestTextCompletionIntegration:
        assert call_args.kwargs['model'] == "gpt-4"
        assert call_args.kwargs['temperature'] == 0.8
        assert call_args.kwargs['max_tokens'] == 2048
-        assert call_args.kwargs['top_p'] == 1
+        # Note: top_p, frequency_penalty, and presence_penalty
-        assert call_args.kwargs['frequency_penalty'] == 0
+        # were removed in #561 as unnecessary parameters
        assert call_args.kwargs['presence_penalty'] == 0
    @pytest.mark.asyncio
    @pytest.mark.slow
--- a/tests/integration/test_text_completion_streaming_integration.py
+++ b/tests/integration/test_text_completion_streaming_integration.py
@ -0,0 +1,366 @@
 """
 Integration tests for Text Completion Streaming Functionality
 These tests verify the streaming behavior of the Text Completion service,
 testing token-by-token response delivery through the complete pipeline.
 """
 import pytest
 from unittest.mock import AsyncMock, MagicMock
 from openai.types.chat import ChatCompletionChunk
 from openai.types.chat.chat_completion_chunk import Choice as StreamChoice, ChoiceDelta
 from trustgraph.model.text_completion.openai.llm import Processor
 from trustgraph.base import LlmChunk
 from tests.utils.streaming_assertions import (
    assert_streaming_chunks_valid,
    assert_callback_invoked,
 )
@pytest.mark.integration
 class TestTextCompletionStreaming:
    """Integration tests for Text Completion streaming"""
    @pytest.fixture
    def mock_streaming_openai_client(self, mock_streaming_llm_response):
        """Mock OpenAI client with streaming support"""
        client = MagicMock()
        def create_streaming_completion(**kwargs):
            """Generator that yields streaming chunks"""
            # Check if streaming is enabled
            if not kwargs.get('stream', False):
                raise ValueError("Expected streaming mode")
            # Simulate OpenAI streaming response
            chunks_text = [
                "Machine", " learning", " is", " a", " subset",
                " of", " AI", " that", " enables", " computers",
                " to", " learn", " from", " data", "."
            ]
            for text in chunks_text:
                delta = ChoiceDelta(content=text, role=None)
                choice = StreamChoice(index=0, delta=delta, finish_reason=None)
                chunk = ChatCompletionChunk(
                    id="chatcmpl-streaming",
                    choices=[choice],
                    created=1234567890,
                    model="gpt-3.5-turbo",
                    object="chat.completion.chunk"
                )
                yield chunk
        # Return a new generator each time create is called
        client.chat.completions.create.side_effect = lambda **kwargs: create_streaming_completion(**kwargs)
        return client
    @pytest.fixture
    def text_completion_processor_streaming(self, mock_streaming_openai_client):
        """Create text completion processor with streaming support"""
        processor = MagicMock()
        processor.default_model = "gpt-3.5-turbo"
        processor.temperature = 0.7
        processor.max_output = 1024
        processor.openai = mock_streaming_openai_client
        # Bind the actual streaming method
        processor.generate_content_stream = Processor.generate_content_stream.__get__(
            processor, Processor
        )
        return processor
    @pytest.mark.asyncio
    async def test_text_completion_streaming_basic(self, text_completion_processor_streaming,
                                                     streaming_chunk_collector):
        """Test basic text completion streaming functionality"""
        # Arrange
        system_prompt = "You are a helpful assistant."
        user_prompt = "What is machine learning?"
        collector = streaming_chunk_collector()
        # Act - Collect all chunks
        chunks = []
        async for chunk in text_completion_processor_streaming.generate_content_stream(
            system_prompt, user_prompt
        ):
            chunks.append(chunk)
            if chunk.text:
                await collector.collect(chunk.text)
        # Assert
        assert len(chunks) > 1  # Should have multiple chunks
        # Verify all chunks are LlmChunk objects
        for chunk in chunks:
            assert isinstance(chunk, LlmChunk)
            assert chunk.model == "gpt-3.5-turbo"
        # Verify last chunk has is_final=True
        assert chunks[-1].is_final is True
        # Verify we got meaningful content
        full_text = collector.get_full_text()
        assert "machine" in full_text.lower() or "learning" in full_text.lower()
    @pytest.mark.asyncio
    async def test_text_completion_streaming_chunk_structure(self, text_completion_processor_streaming):
        """Test that streaming chunks have correct structure"""
        # Arrange
        system_prompt = "You are a helpful assistant."
        user_prompt = "Explain AI."
        # Act
        chunks = []
        async for chunk in text_completion_processor_streaming.generate_content_stream(
            system_prompt, user_prompt
        ):
            chunks.append(chunk)
        # Assert - Verify chunk structure
        for i, chunk in enumerate(chunks[:-1]):  # All except last
            assert isinstance(chunk, LlmChunk)
            assert chunk.text is not None
            assert chunk.model == "gpt-3.5-turbo"
            assert chunk.is_final is False
        # Last chunk should be final marker
        final_chunk = chunks[-1]
        assert final_chunk.is_final is True
        assert final_chunk.model == "gpt-3.5-turbo"
    @pytest.mark.asyncio
    async def test_text_completion_streaming_concatenation(self, text_completion_processor_streaming):
        """Test that chunks concatenate to form complete response"""
        # Arrange
        system_prompt = "You are a helpful assistant."
        user_prompt = "What is AI?"
        # Act - Collect all chunk texts
        chunk_texts = []
        async for chunk in text_completion_processor_streaming.generate_content_stream(
            system_prompt, user_prompt
        ):
            if chunk.text and not chunk.is_final:
                chunk_texts.append(chunk.text)
        # Assert
        full_text = "".join(chunk_texts)
        assert len(full_text) > 0
        assert len(chunk_texts) > 1  # Should have multiple chunks
        # Verify completeness - should be a coherent sentence
        assert full_text == "Machine learning is a subset of AI that enables computers to learn from data."
    @pytest.mark.asyncio
    async def test_text_completion_streaming_final_marker(self, text_completion_processor_streaming):
        """Test that final chunk properly marks end of stream"""
        # Arrange
        system_prompt = "You are a helpful assistant."
        user_prompt = "Test query"
        # Act
        chunks = []
        async for chunk in text_completion_processor_streaming.generate_content_stream(
            system_prompt, user_prompt
        ):
            chunks.append(chunk)
        # Assert
        # Should have at least content chunks + final marker
        assert len(chunks) >= 2
        # Only the last chunk should have is_final=True
        for chunk in chunks[:-1]:
            assert chunk.is_final is False
        assert chunks[-1].is_final is True
    @pytest.mark.asyncio
    async def test_text_completion_streaming_model_parameter(self, mock_streaming_openai_client):
        """Test that model parameter is preserved in streaming"""
        # Arrange
        processor = MagicMock()
        processor.default_model = "gpt-4"
        processor.temperature = 0.5
        processor.max_output = 2048
        processor.openai = mock_streaming_openai_client
        processor.generate_content_stream = Processor.generate_content_stream.__get__(
            processor, Processor
        )
        # Act
        chunks = []
        async for chunk in processor.generate_content_stream("System", "Prompt"):
            chunks.append(chunk)
        # Assert
        # Verify OpenAI was called with correct model
        call_args = mock_streaming_openai_client.chat.completions.create.call_args
        assert call_args.kwargs['model'] == "gpt-4"
        assert call_args.kwargs['temperature'] == 0.5
        assert call_args.kwargs['max_tokens'] == 2048
        assert call_args.kwargs['stream'] is True
        # Verify chunks have correct model
        for chunk in chunks:
            assert chunk.model == "gpt-4"
    @pytest.mark.asyncio
    async def test_text_completion_streaming_temperature_parameter(self, mock_streaming_openai_client):
        """Test that temperature parameter is applied in streaming"""
        # Arrange
        temperatures = [0.0, 0.5, 1.0, 1.5]
        for temp in temperatures:
            processor = MagicMock()
            processor.default_model = "gpt-3.5-turbo"
            processor.temperature = temp
            processor.max_output = 1024
            processor.openai = mock_streaming_openai_client
            processor.generate_content_stream = Processor.generate_content_stream.__get__(
                processor, Processor
            )
            # Act
            chunks = []
            async for chunk in processor.generate_content_stream("System", "Prompt"):
                chunks.append(chunk)
                if chunk.is_final:
                    break
            # Assert
            call_args = mock_streaming_openai_client.chat.completions.create.call_args
            assert call_args.kwargs['temperature'] == temp
            # Reset mock for next iteration
            mock_streaming_openai_client.reset_mock()
    @pytest.mark.asyncio
    async def test_text_completion_streaming_error_propagation(self):
        """Test that errors during streaming are properly propagated"""
        # Arrange
        mock_client = MagicMock()
        def failing_stream(**kwargs):
            yield from []
            raise Exception("Streaming error")
        mock_client.chat.completions.create.return_value = failing_stream()
        processor = MagicMock()
        processor.default_model = "gpt-3.5-turbo"
        processor.temperature = 0.7
        processor.max_output = 1024
        processor.openai = mock_client
        processor.generate_content_stream = Processor.generate_content_stream.__get__(
            processor, Processor
        )
        # Act & Assert
        with pytest.raises(Exception) as exc_info:
            async for chunk in processor.generate_content_stream("System", "Prompt"):
                pass
        assert "Streaming error" in str(exc_info.value)
    @pytest.mark.asyncio
    async def test_text_completion_streaming_empty_chunks_filtered(self, mock_streaming_openai_client):
        """Test that empty chunks are handled correctly"""
        # Arrange - Mock that returns some empty chunks
        def create_streaming_with_empties(**kwargs):
            chunks_text = ["Hello", "", " world", "", "!"]
            for text in chunks_text:
                delta = ChoiceDelta(content=text if text else None, role=None)
                choice = StreamChoice(index=0, delta=delta, finish_reason=None)
                chunk = ChatCompletionChunk(
                    id="chatcmpl-streaming",
                    choices=[choice],
                    created=1234567890,
                    model="gpt-3.5-turbo",
                    object="chat.completion.chunk"
                )
                yield chunk
        mock_streaming_openai_client.chat.completions.create.side_effect = lambda **kwargs: create_streaming_with_empties(**kwargs)
        processor = MagicMock()
        processor.default_model = "gpt-3.5-turbo"
        processor.temperature = 0.7
        processor.max_output = 1024
        processor.openai = mock_streaming_openai_client
        processor.generate_content_stream = Processor.generate_content_stream.__get__(
            processor, Processor
        )
        # Act
        chunks = []
        async for chunk in processor.generate_content_stream("System", "Prompt"):
            chunks.append(chunk)
        # Assert - Only non-empty chunks should be yielded (plus final marker)
        text_chunks = [c for c in chunks if not c.is_final]
        assert len(text_chunks) == 3  # "Hello", " world", "!"
        assert "".join(c.text for c in text_chunks) == "Hello world!"
    @pytest.mark.asyncio
    async def test_text_completion_streaming_prompt_construction(self, mock_streaming_openai_client):
        """Test that system and user prompts are correctly combined for streaming"""
        # Arrange
        processor = MagicMock()
        processor.default_model = "gpt-3.5-turbo"
        processor.temperature = 0.7
        processor.max_output = 1024
        processor.openai = mock_streaming_openai_client
        processor.generate_content_stream = Processor.generate_content_stream.__get__(
            processor, Processor
        )
        system_prompt = "You are an expert."
        user_prompt = "Explain quantum physics."
        # Act
        chunks = []
        async for chunk in processor.generate_content_stream(system_prompt, user_prompt):
            chunks.append(chunk)
            if chunk.is_final:
                break
        # Assert - Verify prompts were combined correctly
        call_args = mock_streaming_openai_client.chat.completions.create.call_args
        messages = call_args.kwargs['messages']
        assert len(messages) == 1
        message_content = messages[0]['content'][0]['text']
        assert system_prompt in message_content
        assert user_prompt in message_content
        assert message_content.startswith(system_prompt)
    @pytest.mark.asyncio
    async def test_text_completion_streaming_chunk_count(self, text_completion_processor_streaming):
        """Test that streaming produces expected number of chunks"""
        # Arrange
        system_prompt = "You are a helpful assistant."
        user_prompt = "Test"
        # Act
        chunks = []
        async for chunk in text_completion_processor_streaming.generate_content_stream(
            system_prompt, user_prompt
        ):
            chunks.append(chunk)
        # Assert
        # Should have 15 content chunks + 1 final marker = 16 total
        assert len(chunks) == 16
        # 15 content chunks
        content_chunks = [c for c in chunks if not c.is_final]
        assert len(content_chunks) == 15
        # 1 final marker
        final_chunks = [c for c in chunks if c.is_final]
        assert len(final_chunks) == 1
--- a/tests/unit/test_query/test_doc_embeddings_pinecone_query.py
+++ b/tests/unit/test_query/test_doc_embeddings_pinecone_query.py
@ -5,6 +5,9 @@ Tests for Pinecone document embeddings query service
 import pytest
 from unittest.mock import MagicMock, patch
 # Skip all tests in this module due to missing Pinecone dependency
 pytest.skip("Pinecone library missing protoc_gen_openapiv2 dependency", allow_module_level=True)
 from trustgraph.query.doc_embeddings.pinecone.service import Processor
--- a/tests/unit/test_query/test_graph_embeddings_pinecone_query.py
+++ b/tests/unit/test_query/test_graph_embeddings_pinecone_query.py
@ -5,6 +5,9 @@ Tests for Pinecone graph embeddings query service
 import pytest
 from unittest.mock import MagicMock, patch
 # Skip all tests in this module due to missing Pinecone dependency
 pytest.skip("Pinecone library missing protoc_gen_openapiv2 dependency", allow_module_level=True)
 from trustgraph.query.graph_embeddings.pinecone.service import Processor
 from trustgraph.schema import Value
--- a/tests/unit/test_storage/test_doc_embeddings_pinecone_storage.py
+++ b/tests/unit/test_storage/test_doc_embeddings_pinecone_storage.py
@ -6,6 +6,9 @@ import pytest
 from unittest.mock import MagicMock, patch
 import uuid
 # Skip all tests in this module due to missing Pinecone dependency
 pytest.skip("Pinecone library missing protoc_gen_openapiv2 dependency", allow_module_level=True)
 from trustgraph.storage.doc_embeddings.pinecone.write import Processor
 from trustgraph.schema import ChunkEmbeddings
--- a/tests/unit/test_storage/test_graph_embeddings_pinecone_storage.py
+++ b/tests/unit/test_storage/test_graph_embeddings_pinecone_storage.py
@ -6,6 +6,9 @@ import pytest
 from unittest.mock import MagicMock, patch
 import uuid
 # Skip all tests in this module due to missing Pinecone dependency
 pytest.skip("Pinecone library missing protoc_gen_openapiv2 dependency", allow_module_level=True)
 from trustgraph.storage.graph_embeddings.pinecone.write import Processor
 from trustgraph.schema import EntityEmbeddings, Value
--- a/tests/utils/init.py
+++ b/tests/utils/init.py
@ -0,0 +1,29 @@
 """Test utilities for TrustGraph tests"""
 from .streaming_assertions import (
    assert_streaming_chunks_valid,
    assert_streaming_sequence,
    assert_agent_streaming_chunks,
    assert_rag_streaming_chunks,
    assert_streaming_completion,
    assert_streaming_content_matches,
    assert_no_empty_chunks,
    assert_streaming_error_handled,
    assert_chunk_types_valid,
    assert_streaming_latency_acceptable,
    assert_callback_invoked,
 )
 __all__ = [
    "assert_streaming_chunks_valid",
    "assert_streaming_sequence",
    "assert_agent_streaming_chunks",
    "assert_rag_streaming_chunks",
    "assert_streaming_completion",
    "assert_streaming_content_matches",
    "assert_no_empty_chunks",
    "assert_streaming_error_handled",
    "assert_chunk_types_valid",
    "assert_streaming_latency_acceptable",
    "assert_callback_invoked",
 ]
--- a/tests/utils/streaming_assertions.py
+++ b/tests/utils/streaming_assertions.py
@ -0,0 +1,218 @@
 """
 Streaming test assertion helpers
 Provides reusable assertion functions for validating streaming behavior
 across different TrustGraph services.
 """
 from typing import List, Dict, Any, Optional
 def assert_streaming_chunks_valid(chunks: List[Any], min_chunks: int = 1):
    """
    Assert that streaming chunks are valid and non-empty.
    Args:
        chunks: List of streaming chunks
        min_chunks: Minimum number of expected chunks
    """
    assert len(chunks) >= min_chunks, f"Expected at least {min_chunks} chunks, got {len(chunks)}"
    assert all(chunk is not None for chunk in chunks), "All chunks should be non-None"
 def assert_streaming_sequence(chunks: List[Dict[str, Any]], expected_sequence: List[str], key: str = "chunk_type"):
    """
    Assert that streaming chunks follow an expected sequence.
    Args:
        chunks: List of chunk dictionaries
        expected_sequence: Expected sequence of chunk types/values
        key: Dictionary key to check (default: "chunk_type")
    """
    actual_sequence = [chunk.get(key) for chunk in chunks if key in chunk]
    assert actual_sequence == expected_sequence, \
        f"Expected sequence {expected_sequence}, got {actual_sequence}"
 def assert_agent_streaming_chunks(chunks: List[Dict[str, Any]]):
    """
    Assert that agent streaming chunks have valid structure.
    Validates:
    - All chunks have chunk_type field
    - All chunks have content field
    - All chunks have end_of_message field
    - All chunks have end_of_dialog field
    - Last chunk has end_of_dialog=True
    Args:
        chunks: List of agent streaming chunk dictionaries
    """
    assert len(chunks) > 0, "Expected at least one chunk"
    for i, chunk in enumerate(chunks):
        assert "chunk_type" in chunk, f"Chunk {i} missing chunk_type"
        assert "content" in chunk, f"Chunk {i} missing content"
        assert "end_of_message" in chunk, f"Chunk {i} missing end_of_message"
        assert "end_of_dialog" in chunk, f"Chunk {i} missing end_of_dialog"
        # Validate chunk_type values
        valid_types = ["thought", "action", "observation", "final-answer"]
        assert chunk["chunk_type"] in valid_types, \
            f"Invalid chunk_type '{chunk['chunk_type']}' at index {i}"
    # Last chunk should signal end of dialog
    assert chunks[-1]["end_of_dialog"] is True, \
        "Last chunk should have end_of_dialog=True"
 def assert_rag_streaming_chunks(chunks: List[Dict[str, Any]]):
    """
    Assert that RAG streaming chunks have valid structure.
    Validates:
    - All chunks except last have chunk field
    - All chunks have end_of_stream field
    - Last chunk has end_of_stream=True
    - Last chunk may have response field with complete text
    Args:
        chunks: List of RAG streaming chunk dictionaries
    """
    assert len(chunks) > 0, "Expected at least one chunk"
    for i, chunk in enumerate(chunks):
        assert "end_of_stream" in chunk, f"Chunk {i} missing end_of_stream"
        if i < len(chunks) - 1:
            # Non-final chunks should have chunk content and end_of_stream=False
            assert "chunk" in chunk, f"Chunk {i} missing chunk field"
            assert chunk["end_of_stream"] is False, \
                f"Non-final chunk {i} should have end_of_stream=False"
        else:
            # Final chunk should have end_of_stream=True
            assert chunk["end_of_stream"] is True, \
                "Last chunk should have end_of_stream=True"
 def assert_streaming_completion(chunks: List[Dict[str, Any]], expected_complete_flag: str = "end_of_stream"):
    """
    Assert that streaming completed properly.
    Args:
        chunks: List of streaming chunk dictionaries
        expected_complete_flag: Name of the completion flag field
    """
    assert len(chunks) > 0, "Expected at least one chunk"
    # Check that all but last chunk have completion flag = False
    for i, chunk in enumerate(chunks[:-1]):
        assert chunk.get(expected_complete_flag) is False, \
            f"Non-final chunk {i} should have {expected_complete_flag}=False"
    # Check that last chunk has completion flag = True
    assert chunks[-1].get(expected_complete_flag) is True, \
        f"Final chunk should have {expected_complete_flag}=True"
 def assert_streaming_content_matches(chunks: List, expected_content: str, content_key: str = "chunk"):
    """
    Assert that concatenated streaming chunks match expected content.
    Args:
        chunks: List of streaming chunks (strings or dicts)
        expected_content: Expected complete content after concatenation
        content_key: Dictionary key for content (used if chunks are dicts)
    """
    if isinstance(chunks[0], dict):
        # Extract content from chunk dictionaries
        content_chunks = [
            chunk.get(content_key, "")
            for chunk in chunks
            if chunk.get(content_key) is not None
        ]
        actual_content = "".join(content_chunks)
    else:
        # Chunks are already strings
        actual_content = "".join(chunks)
    assert actual_content == expected_content, \
        f"Expected content '{expected_content}', got '{actual_content}'"
 def assert_no_empty_chunks(chunks: List[Dict[str, Any]], content_key: str = "content"):
    """
    Assert that no chunks have empty content (except final chunk if it's completion marker).
    Args:
        chunks: List of streaming chunk dictionaries
        content_key: Dictionary key for content
    """
    for i, chunk in enumerate(chunks[:-1]):
        content = chunk.get(content_key)
        assert content is not None and len(content) > 0, \
            f"Chunk {i} has empty content"
 def assert_streaming_error_handled(chunks: List[Dict[str, Any]], error_flag: str = "error"):
    """
    Assert that streaming error was properly signaled.
    Args:
        chunks: List of streaming chunk dictionaries
        error_flag: Name of the error flag field
    """
    # Check that at least one chunk has error flag
    has_error = any(chunk.get(error_flag) is not None for chunk in chunks)
    assert has_error, "Expected error flag in at least one chunk"
    # If last chunk has error, should also have completion flag
    if chunks[-1].get(error_flag):
        # Check for completion flags (either end_of_stream or end_of_dialog)
        completion_flags = ["end_of_stream", "end_of_dialog"]
        has_completion = any(chunks[-1].get(flag) is True for flag in completion_flags)
        assert has_completion, \
            "Error chunk should have completion flag set to True"
 def assert_chunk_types_valid(chunks: List[Dict[str, Any]], valid_types: List[str], type_key: str = "chunk_type"):
    """
    Assert that all chunk types are from a valid set.
    Args:
        chunks: List of streaming chunk dictionaries
        valid_types: List of valid chunk type values
        type_key: Dictionary key for chunk type
    """
    for i, chunk in enumerate(chunks):
        chunk_type = chunk.get(type_key)
        assert chunk_type in valid_types, \
            f"Chunk {i} has invalid type '{chunk_type}', expected one of {valid_types}"
 def assert_streaming_latency_acceptable(chunk_timestamps: List[float], max_gap_seconds: float = 5.0):
    """
    Assert that streaming latency between chunks is acceptable.
    Args:
        chunk_timestamps: List of timestamps when chunks were received
        max_gap_seconds: Maximum acceptable gap between chunks in seconds
    """
    assert len(chunk_timestamps) > 1, "Need at least 2 timestamps to check latency"
    for i in range(1, len(chunk_timestamps)):
        gap = chunk_timestamps[i] - chunk_timestamps[i-1]
        assert gap <= max_gap_seconds, \
            f"Gap between chunks {i-1} and {i} is {gap:.2f}s, exceeds max {max_gap_seconds}s"
 def assert_callback_invoked(mock_callback, min_calls: int = 1):
    """
    Assert that a streaming callback was invoked minimum number of times.
    Args:
        mock_callback: AsyncMock callback object
        min_calls: Minimum number of expected calls
    """
    assert mock_callback.call_count >= min_calls, \
        f"Expected callback to be called at least {min_calls} times, was called {mock_callback.call_count} times"
--- a/trustgraph-base/trustgraph/base/init.py
+++ b/trustgraph-base/trustgraph/base/init.py
@ -12,7 +12,7 @@ from . parameter_spec import ParameterSpec
 from . producer_spec import ProducerSpec
 from . subscriber_spec import SubscriberSpec
 from . request_response_spec import RequestResponseSpec
-from . llm_service import LlmService, LlmResult
+from . llm_service import LlmService, LlmResult, LlmChunk
 from . chunking_service import ChunkingService
 from . embeddings_service import EmbeddingsService
 from . embeddings_client import EmbeddingsClientSpec
--- a/trustgraph-base/trustgraph/base/llm_service.py
+++ b/trustgraph-base/trustgraph/base/llm_service.py
@ -28,6 +28,19 @@ class LlmResult:
        self.model = model
    __slots__ = ["text", "in_token", "out_token", "model"]
 class LlmChunk:
    """Represents a streaming chunk from an LLM"""
    def __init__(
            self, text = None, in_token = None, out_token = None,
            model = None, is_final = False,
    ):
        self.text = text
        self.in_token = in_token
        self.out_token = out_token
        self.model = model
        self.is_final = is_final
    __slots__ = ["text", "in_token", "out_token", "model", "is_final"]
 class LlmService(FlowProcessor):
    def __init__(self, **params):
@ -99,16 +112,57 @@ class LlmService(FlowProcessor):
            id = msg.properties()["id"]
-            with __class__.text_completion_metric.labels(
+            model = flow("model")
-                    id=self.id,
+            temperature = flow("temperature")
                    flow=f"{flow.name}-{consumer.name}",
            ).time():
-                model = flow("model")
+            # Check if streaming is requested and supported
-                temperature = flow("temperature")
+            streaming = getattr(request, 'streaming', False)
-                response = await self.generate_content(
+            if streaming and self.supports_streaming():
-                    request.system, request.prompt, model, temperature
+
                # Streaming mode
                with __class__.text_completion_metric.labels(
                        id=self.id,
                        flow=f"{flow.name}-{consumer.name}",
                ).time():
                    async for chunk in self.generate_content_stream(
                        request.system, request.prompt, model, temperature
                    ):
                        await flow("response").send(
                            TextCompletionResponse(
                                error=None,
                                response=chunk.text,
                                in_token=chunk.in_token,
                                out_token=chunk.out_token,
                                model=chunk.model,
                                end_of_stream=chunk.is_final
                            ),
                            properties={"id": id}
                        )
            else:
                # Non-streaming mode (original behavior)
                with __class__.text_completion_metric.labels(
                        id=self.id,
                        flow=f"{flow.name}-{consumer.name}",
                ).time():
                    response = await self.generate_content(
                        request.system, request.prompt, model, temperature
                    )
                await flow("response").send(
                    TextCompletionResponse(
                        error=None,
                        response=response.text,
                        in_token=response.in_token,
                        out_token=response.out_token,
                        model=response.model,
                        end_of_stream=True
                    ),
                    properties={"id": id}
                )
            __class__.text_completion_model_metric.labels(
@ -119,17 +173,6 @@ class LlmService(FlowProcessor):
                "temperature": str(temperature) if temperature is not None else "",
            })
            await flow("response").send(
                TextCompletionResponse(
                    error=None,
                    response=response.text,
                    in_token=response.in_token,
                    out_token=response.out_token,
                    model=response.model
                ),
                properties={"id": id}
            )
        except TooManyRequests as e:
            raise e
@ -151,10 +194,26 @@ class LlmService(FlowProcessor):
                    in_token=None,
                    out_token=None,
                    model=None,
                    end_of_stream=True
                ),
                properties={"id": id}
            )
    def supports_streaming(self):
        """
        Override in subclass to indicate streaming support.
        Returns False by default.
        """
        return False
    async def generate_content_stream(self, system, prompt, model=None, temperature=None):
        """
        Override in subclass to implement streaming.
        Should yield LlmChunk objects.
        The final chunk should have is_final=True.
        """
        raise NotImplementedError("Streaming not implemented for this provider")
    @staticmethod
    def add_args(parser):
--- a/trustgraph-base/trustgraph/base/prompt_client.py
+++ b/trustgraph-base/trustgraph/base/prompt_client.py
@ -1,30 +1,95 @@
 import json
 import asyncio
 import logging
 from . request_response_spec import RequestResponse, RequestResponseSpec
 from .. schema import PromptRequest, PromptResponse
 logger = logging.getLogger(__name__)
 class PromptClient(RequestResponse):
-    async def prompt(self, id, variables, timeout=600):
+    async def prompt(self, id, variables, timeout=600, streaming=False, chunk_callback=None):
        logger.info(f"DEBUG prompt_client: prompt called, id={id}, streaming={streaming}, chunk_callback={chunk_callback is not None}")
-        resp = await self.request(
+        if not streaming:
-            PromptRequest(
+            logger.info("DEBUG prompt_client: Non-streaming path")
            # Non-streaming path
            resp = await self.request(
                PromptRequest(
                    id = id,
                    terms = {
                        k: json.dumps(v)
                        for k, v in variables.items()
                    },
                    streaming = False
                ),
                timeout=timeout
            )
            if resp.error:
                raise RuntimeError(resp.error.message)
            if resp.text: return resp.text
            return json.loads(resp.object)
        else:
            logger.info("DEBUG prompt_client: Streaming path")
            # Streaming path - collect all chunks
            full_text = ""
            full_object = None
            async def collect_chunks(resp):
                nonlocal full_text, full_object
                logger.info(f"DEBUG prompt_client: collect_chunks called, resp.text={resp.text[:50] if resp.text else None}, end_of_stream={getattr(resp, 'end_of_stream', False)}")
                if resp.error:
                    logger.error(f"DEBUG prompt_client: Error in response: {resp.error.message}")
                    raise RuntimeError(resp.error.message)
                if resp.text:
                    full_text += resp.text
                    logger.info(f"DEBUG prompt_client: Accumulated {len(full_text)} chars")
                    # Call chunk callback if provided
                    if chunk_callback:
                        logger.info(f"DEBUG prompt_client: Calling chunk_callback")
                        if asyncio.iscoroutinefunction(chunk_callback):
                            await chunk_callback(resp.text)
                        else:
                            chunk_callback(resp.text)
                elif resp.object:
                    logger.info(f"DEBUG prompt_client: Got object response")
                    full_object = resp.object
                end_stream = getattr(resp, 'end_of_stream', False)
                logger.info(f"DEBUG prompt_client: Returning end_of_stream={end_stream}")
                return end_stream
            logger.info("DEBUG prompt_client: Creating PromptRequest")
            req = PromptRequest(
                id = id,
                terms = {
                    k: json.dumps(v)
                    for k, v in variables.items()
-                }
+                },
-            ),
+                streaming = True
-            timeout=timeout
+            )
-        )
+            logger.info(f"DEBUG prompt_client: About to call self.request with recipient, timeout={timeout}")
            await self.request(
                req,
                recipient=collect_chunks,
                timeout=timeout
            )
            logger.info(f"DEBUG prompt_client: self.request returned, full_text has {len(full_text)} chars")
-        if resp.error:
+            if full_text:
-            raise RuntimeError(resp.error.message)
+                logger.info("DEBUG prompt_client: Returning full_text")
                return full_text
-        if resp.text: return resp.text
+            logger.info("DEBUG prompt_client: Returning parsed full_object")
-
+            return json.loads(full_object)
        return json.loads(resp.object)
    async def extract_definitions(self, text, timeout=600):
        return await self.prompt(
@ -47,7 +112,7 @@ class PromptClient(RequestResponse):
            timeout = timeout,
        )
-    async def kg_prompt(self, query, kg, timeout=600):
+    async def kg_prompt(self, query, kg, timeout=600, streaming=False, chunk_callback=None):
        return await self.prompt(
            id = "kg-prompt",
            variables = {
@ -58,9 +123,11 @@ class PromptClient(RequestResponse):
                ]
            },
            timeout = timeout,
            streaming = streaming,
            chunk_callback = chunk_callback,
        )
-    async def document_prompt(self, query, documents, timeout=600):
+    async def document_prompt(self, query, documents, timeout=600, streaming=False, chunk_callback=None):
        return await self.prompt(
            id = "document-prompt",
            variables = {
@ -68,13 +135,17 @@ class PromptClient(RequestResponse):
                "documents": documents,
            },
            timeout = timeout,
            streaming = streaming,
            chunk_callback = chunk_callback,
        )
-    async def agent_react(self, variables, timeout=600):
+    async def agent_react(self, variables, timeout=600, streaming=False, chunk_callback=None):
        return await self.prompt(
            id = "agent-react",
            variables = variables,
            timeout = timeout,
            streaming = streaming,
            chunk_callback = chunk_callback,
        )
    async def question(self, question, timeout=600):
--- a/trustgraph-base/trustgraph/base/subscriber.py
+++ b/trustgraph-base/trustgraph/base/subscriber.py
@ -43,12 +43,18 @@ class Subscriber:
    async def start(self):
-        self.consumer = self.client.subscribe(
+        # Build subscribe arguments
-            topic = self.topic,
+        subscribe_args = {
-            subscription_name = self.subscription,
+            'topic': self.topic,
-            consumer_name = self.consumer_name,
+            'subscription_name': self.subscription,
-            schema = JsonSchema(self.schema),
+            'consumer_name': self.consumer_name,
-        )
+        }
        # Only add schema if provided (omit if None)
        if self.schema is not None:
            subscribe_args['schema'] = JsonSchema(self.schema)
        self.consumer = self.client.subscribe(**subscribe_args)
        self.task = asyncio.create_task(self.run())
@ -87,10 +93,14 @@ class Subscriber:
                    if self.draining and drain_end_time is None:
                        drain_end_time = time.time() + self.drain_timeout
                        logger.info(f"Subscriber entering drain mode, timeout={self.drain_timeout}s")
-                        
+
                        # Stop accepting new messages from Pulsar during drain
                        if self.consumer:
-                            self.consumer.pause_message_listener()
+                            try:
                                self.consumer.pause_message_listener()
                            except _pulsar.InvalidConfiguration:
                                # Not all consumers have message listeners (e.g., blocking receive mode)
                                pass
                    # Check drain timeout
                    if self.draining and drain_end_time and time.time() > drain_end_time:
@ -145,12 +155,21 @@ class Subscriber:
            finally:
                # Negative acknowledge any pending messages
                for msg in self.pending_acks.values():
-                    self.consumer.negative_acknowledge(msg)
+                    try:
                        self.consumer.negative_acknowledge(msg)
                    except _pulsar.AlreadyClosed:
                        pass  # Consumer already closed
                self.pending_acks.clear()
                if self.consumer:
-                    self.consumer.unsubscribe()
+                    try:
-                    self.consumer.close()
+                        self.consumer.unsubscribe()
                    except _pulsar.AlreadyClosed:
                        pass  # Already closed
                    try:
                        self.consumer.close()
                    except _pulsar.AlreadyClosed:
                        pass  # Already closed
                    self.consumer = None
--- a/trustgraph-base/trustgraph/base/text_completion_client.py
+++ b/trustgraph-base/trustgraph/base/text_completion_client.py
@ -3,18 +3,45 @@ from . request_response_spec import RequestResponse, RequestResponseSpec
 from .. schema import TextCompletionRequest, TextCompletionResponse
 class TextCompletionClient(RequestResponse):
-    async def text_completion(self, system, prompt, timeout=600):
+    async def text_completion(self, system, prompt, streaming=False, timeout=600):
-        resp = await self.request(
+        # If not streaming, use original behavior
        if not streaming:
            resp = await self.request(
                TextCompletionRequest(
                    system = system, prompt = prompt, streaming = False
                ),
                timeout=timeout
            )
            if resp.error:
                raise RuntimeError(resp.error.message)
            return resp.response
        # For streaming: collect all chunks and return complete response
        full_response = ""
        async def collect_chunks(resp):
            nonlocal full_response
            if resp.error:
                raise RuntimeError(resp.error.message)
            if resp.response:
                full_response += resp.response
            # Return True when end_of_stream is reached
            return getattr(resp, 'end_of_stream', False)
        await self.request(
            TextCompletionRequest(
-                system = system, prompt = prompt
+                system = system, prompt = prompt, streaming = True
            ),
            recipient=collect_chunks,
            timeout=timeout
        )
-        if resp.error:
+        return full_response
            raise RuntimeError(resp.error.message)
        return resp.response
 class TextCompletionClientSpec(RequestResponseSpec):
    def __init__(
--- a/trustgraph-base/trustgraph/clients/llm_client.py
+++ b/trustgraph-base/trustgraph/clients/llm_client.py
@ -5,6 +5,7 @@ from .. schema import TextCompletionRequest, TextCompletionResponse
 from .. schema import text_completion_request_queue
 from .. schema import text_completion_response_queue
 from . base import BaseClient
 from .. exceptions import LlmError
 # Ugly
 ERROR=_pulsar.LoggerLevel.Error
@ -37,8 +38,68 @@ class LlmClient(BaseClient):
            output_schema=TextCompletionResponse,
        )
-    def request(self, system, prompt, timeout=300):
+    def request(self, system, prompt, timeout=300, streaming=False):
        """
        Non-streaming request (backward compatible).
        Returns complete response string.
        """
        if streaming:
            raise ValueError("Use request_stream() for streaming requests")
        return self.call(
-            system=system, prompt=prompt, timeout=timeout
+            system=system, prompt=prompt, streaming=False, timeout=timeout
        ).response
    def request_stream(self, system, prompt, timeout=300):
        """
        Streaming request generator.
        Yields response chunks as they arrive.
        Usage:
            for chunk in client.request_stream(system, prompt):
                print(chunk.response, end='', flush=True)
        """
        import time
        import uuid
        id = str(uuid.uuid4())
        request = TextCompletionRequest(
            system=system, prompt=prompt, streaming=True
        )
        end_time = time.time() + timeout
        self.producer.send(request, properties={"id": id})
        # Collect responses until end_of_stream
        while time.time() < end_time:
            try:
                msg = self.consumer.receive(timeout_millis=2500)
            except Exception:
                continue
            mid = msg.properties()["id"]
            if mid == id:
                value = msg.value()
                # Handle errors
                if value.error:
                    self.consumer.acknowledge(msg)
                    if value.error.type == "llm-error":
                        raise LlmError(value.error.message)
                    else:
                        raise RuntimeError(
                            f"{value.error.type}: {value.error.message}"
                        )
                self.consumer.acknowledge(msg)
                yield value
                # Check if this is the final chunk
                if getattr(value, 'end_of_stream', True):
                    break
            else:
                # Ignore messages with wrong ID
                self.consumer.acknowledge(msg)
        if time.time() >= end_time:
            raise TimeoutError("Timed out waiting for response")
--- a/trustgraph-base/trustgraph/messaging/translators/agent.py
+++ b/trustgraph-base/trustgraph/messaging/translators/agent.py
@ -12,16 +12,18 @@ class AgentRequestTranslator(MessageTranslator):
            state=data.get("state", None),
            group=data.get("group", None),
            history=data.get("history", []),
-            user=data.get("user", "trustgraph")
+            user=data.get("user", "trustgraph"),
            streaming=data.get("streaming", False)
        )
-    
+
    def from_pulsar(self, obj: AgentRequest) -> Dict[str, Any]:
        return {
            "question": obj.question,
            "state": obj.state,
            "group": obj.group,
            "history": obj.history,
-            "user": obj.user
+            "user": obj.user,
            "streaming": getattr(obj, "streaming", False)
        }
@ -33,14 +35,36 @@ class AgentResponseTranslator(MessageTranslator):
    def from_pulsar(self, obj: AgentResponse) -> Dict[str, Any]:
        result = {}
-        if obj.answer:
+
-            result["answer"] = obj.answer
+        # Check if this is a streaming response (has chunk_type)
-        if obj.thought:
+        if hasattr(obj, 'chunk_type') and obj.chunk_type:
-            result["thought"] = obj.thought  
+            result["chunk_type"] = obj.chunk_type
-        if obj.observation:
+            if obj.content:
-            result["observation"] = obj.observation
+                result["content"] = obj.content
            result["end_of_message"] = getattr(obj, "end_of_message", False)
            result["end_of_dialog"] = getattr(obj, "end_of_dialog", False)
        else:
            # Legacy format
            if obj.answer:
                result["answer"] = obj.answer
            if obj.thought:
                result["thought"] = obj.thought
            if obj.observation:
                result["observation"] = obj.observation
        # Always include error if present
        if hasattr(obj, 'error') and obj.error and obj.error.message:
            result["error"] = {"message": obj.error.message, "code": obj.error.code}
        return result
-    
+
    def from_response_with_completion(self, obj: AgentResponse) -> Tuple[Dict[str, Any], bool]:
        """Returns (response_dict, is_final)"""
-        return self.from_pulsar(obj), (obj.answer is not None)
+        # For streaming responses, check end_of_dialog
        if hasattr(obj, 'chunk_type') and obj.chunk_type:
            is_final = getattr(obj, 'end_of_dialog', False)
        else:
            # For legacy responses, check if answer is present
            is_final = (obj.answer is not None)
        return self.from_pulsar(obj), is_final
--- a/trustgraph-base/trustgraph/messaging/translators/prompt.py
+++ b/trustgraph-base/trustgraph/messaging/translators/prompt.py
@ -16,10 +16,11 @@ class PromptRequestTranslator(MessageTranslator):
                k: json.dumps(v)
                for k, v in data["variables"].items()
            }
-        
+
        return PromptRequest(
            id=data.get("id"),
-            terms=terms
+            terms=terms,
            streaming=data.get("streaming", False)
        )
    def from_pulsar(self, obj: PromptRequest) -> Dict[str, Any]:
@ -51,4 +52,6 @@ class PromptResponseTranslator(MessageTranslator):
    def from_response_with_completion(self, obj: PromptResponse) -> Tuple[Dict[str, Any], bool]:
        """Returns (response_dict, is_final)"""
-        return self.from_pulsar(obj), True
+        # Check end_of_stream field to determine if this is the final message
        is_final = getattr(obj, 'end_of_stream', True)
        return self.from_pulsar(obj), is_final
--- a/trustgraph-base/trustgraph/messaging/translators/retrieval.py
+++ b/trustgraph-base/trustgraph/messaging/translators/retrieval.py
@ -5,43 +5,65 @@ from .base import MessageTranslator
 class DocumentRagRequestTranslator(MessageTranslator):
    """Translator for DocumentRagQuery schema objects"""
-    
+
    def to_pulsar(self, data: Dict[str, Any]) -> DocumentRagQuery:
        return DocumentRagQuery(
            query=data["query"],
            user=data.get("user", "trustgraph"),
            collection=data.get("collection", "default"),
-            doc_limit=int(data.get("doc-limit", 20))
+            doc_limit=int(data.get("doc-limit", 20)),
            streaming=data.get("streaming", False)
        )
-    
+
    def from_pulsar(self, obj: DocumentRagQuery) -> Dict[str, Any]:
        return {
            "query": obj.query,
            "user": obj.user,
            "collection": obj.collection,
-            "doc-limit": obj.doc_limit
+            "doc-limit": obj.doc_limit,
            "streaming": getattr(obj, "streaming", False)
        }
 class DocumentRagResponseTranslator(MessageTranslator):
    """Translator for DocumentRagResponse schema objects"""
-    
+
    def to_pulsar(self, data: Dict[str, Any]) -> DocumentRagResponse:
        raise NotImplementedError("Response translation to Pulsar not typically needed")
-    
+
    def from_pulsar(self, obj: DocumentRagResponse) -> Dict[str, Any]:
-        return {
+        result = {}
-            "response": obj.response
+
-        }
+        # Check if this is a streaming response (has chunk)
-    
+        if hasattr(obj, 'chunk') and obj.chunk:
            result["chunk"] = obj.chunk
            result["end_of_stream"] = getattr(obj, "end_of_stream", False)
        else:
            # Non-streaming response
            if obj.response:
                result["response"] = obj.response
        # Always include error if present
        if hasattr(obj, 'error') and obj.error and obj.error.message:
            result["error"] = {"message": obj.error.message, "type": obj.error.type}
        return result
    def from_response_with_completion(self, obj: DocumentRagResponse) -> Tuple[Dict[str, Any], bool]:
        """Returns (response_dict, is_final)"""
-        return self.from_pulsar(obj), True
+        # For streaming responses, check end_of_stream
        if hasattr(obj, 'chunk') and obj.chunk:
            is_final = getattr(obj, 'end_of_stream', False)
        else:
            # For non-streaming responses, it's always final
            is_final = True
        return self.from_pulsar(obj), is_final
 class GraphRagRequestTranslator(MessageTranslator):
    """Translator for GraphRagQuery schema objects"""
-    
+
    def to_pulsar(self, data: Dict[str, Any]) -> GraphRagQuery:
        return GraphRagQuery(
            query=data["query"],
@ -50,9 +72,10 @@ class GraphRagRequestTranslator(MessageTranslator):
            entity_limit=int(data.get("entity-limit", 50)),
            triple_limit=int(data.get("triple-limit", 30)),
            max_subgraph_size=int(data.get("max-subgraph-size", 1000)),
-            max_path_length=int(data.get("max-path-length", 2))
+            max_path_length=int(data.get("max-path-length", 2)),
            streaming=data.get("streaming", False)
        )
-    
+
    def from_pulsar(self, obj: GraphRagQuery) -> Dict[str, Any]:
        return {
            "query": obj.query,
@ -61,21 +84,42 @@ class GraphRagRequestTranslator(MessageTranslator):
            "entity-limit": obj.entity_limit,
            "triple-limit": obj.triple_limit,
            "max-subgraph-size": obj.max_subgraph_size,
-            "max-path-length": obj.max_path_length
+            "max-path-length": obj.max_path_length,
            "streaming": getattr(obj, "streaming", False)
        }
 class GraphRagResponseTranslator(MessageTranslator):
    """Translator for GraphRagResponse schema objects"""
-    
+
    def to_pulsar(self, data: Dict[str, Any]) -> GraphRagResponse:
        raise NotImplementedError("Response translation to Pulsar not typically needed")
-    
+
    def from_pulsar(self, obj: GraphRagResponse) -> Dict[str, Any]:
-        return {
+        result = {}
-            "response": obj.response
+
-        }
+        # Check if this is a streaming response (has chunk)
-    
+        if hasattr(obj, 'chunk') and obj.chunk:
            result["chunk"] = obj.chunk
            result["end_of_stream"] = getattr(obj, "end_of_stream", False)
        else:
            # Non-streaming response
            if obj.response:
                result["response"] = obj.response
        # Always include error if present
        if hasattr(obj, 'error') and obj.error and obj.error.message:
            result["error"] = {"message": obj.error.message, "type": obj.error.type}
        return result
    def from_response_with_completion(self, obj: GraphRagResponse) -> Tuple[Dict[str, Any], bool]:
        """Returns (response_dict, is_final)"""
-        return self.from_pulsar(obj), True
+        # For streaming responses, check end_of_stream
        if hasattr(obj, 'chunk') and obj.chunk:
            is_final = getattr(obj, 'end_of_stream', False)
        else:
            # For non-streaming responses, it's always final
            is_final = True
        return self.from_pulsar(obj), is_final
--- a/trustgraph-base/trustgraph/messaging/translators/text_completion.py
+++ b/trustgraph-base/trustgraph/messaging/translators/text_completion.py
@ -5,11 +5,12 @@ from .base import MessageTranslator
 class TextCompletionRequestTranslator(MessageTranslator):
    """Translator for TextCompletionRequest schema objects"""
-    
+
    def to_pulsar(self, data: Dict[str, Any]) -> TextCompletionRequest:
        return TextCompletionRequest(
            system=data["system"],
-            prompt=data["prompt"]
+            prompt=data["prompt"],
            streaming=data.get("streaming", False)
        )
    def from_pulsar(self, obj: TextCompletionRequest) -> Dict[str, Any]:
@ -39,4 +40,6 @@ class TextCompletionResponseTranslator(MessageTranslator):
    def from_response_with_completion(self, obj: TextCompletionResponse) -> Tuple[Dict[str, Any], bool]:
        """Returns (response_dict, is_final)"""
-        return self.from_pulsar(obj), True
+        # Check end_of_stream field to determine if this is the final message
        is_final = getattr(obj, 'end_of_stream', True)
        return self.from_pulsar(obj), is_final
--- a/trustgraph-base/trustgraph/schema/services/agent.py
+++ b/trustgraph-base/trustgraph/schema/services/agent.py
@ -1,5 +1,5 @@
-from pulsar.schema import Record, String, Array, Map
+from pulsar.schema import Record, String, Array, Map, Boolean
 from ..core.topic import topic
 from ..core.primitives import Error
@ -21,8 +21,16 @@ class AgentRequest(Record):
    group = Array(String())
    history = Array(AgentStep())
    user = String()              # User context for multi-tenancy
    streaming = Boolean()        # NEW: Enable streaming response delivery (default false)
 class AgentResponse(Record):
    # Streaming-first design
    chunk_type = String()        # "thought", "action", "observation", "answer", "error"
    content = String()           # The actual content (interpretation depends on chunk_type)
    end_of_message = Boolean()   # Current chunk type (thought/action/etc.) is complete
    end_of_dialog = Boolean()    # Entire agent dialog is complete
    # Legacy fields (deprecated but kept for backward compatibility)
    answer = String()
    error = Error()
    thought = String()
--- a/trustgraph-base/trustgraph/schema/services/llm.py
+++ b/trustgraph-base/trustgraph/schema/services/llm.py
@ -1,5 +1,5 @@
-from pulsar.schema import Record, String, Array, Double, Integer
+from pulsar.schema import Record, String, Array, Double, Integer, Boolean
 from ..core.topic import topic
 from ..core.primitives import Error
@ -11,6 +11,7 @@ from ..core.primitives import Error
 class TextCompletionRequest(Record):
    system = String()
    prompt = String()
    streaming = Boolean()  # Default false for backward compatibility
 class TextCompletionResponse(Record):
    error = Error()
@ -18,6 +19,7 @@ class TextCompletionResponse(Record):
    in_token = Integer()
    out_token = Integer()
    model = String()
    end_of_stream = Boolean()  # Indicates final message in stream
 ############################################################################
--- a/trustgraph-base/trustgraph/schema/services/prompt.py
+++ b/trustgraph-base/trustgraph/schema/services/prompt.py
@ -1,4 +1,4 @@
-from pulsar.schema import Record, String, Map
+from pulsar.schema import Record, String, Map, Boolean
 from ..core.primitives import Error
 from ..core.topic import topic
@ -24,6 +24,9 @@ class PromptRequest(Record):
    # JSON encoded values
    terms = Map(String())
    # Streaming support (default false for backward compatibility)
    streaming = Boolean()
 class PromptResponse(Record):
    # Error case
@ -35,4 +38,7 @@ class PromptResponse(Record):
    # JSON encoded
    object = String()
    # Indicates final message in stream
    end_of_stream = Boolean()
 ############################################################################
--- a/trustgraph-base/trustgraph/schema/services/retrieval.py
+++ b/trustgraph-base/trustgraph/schema/services/retrieval.py
@ -15,10 +15,13 @@ class GraphRagQuery(Record):
    triple_limit = Integer()
    max_subgraph_size = Integer()
    max_path_length = Integer()
    streaming = Boolean()
 class GraphRagResponse(Record):
    error = Error()
    response = String()
    chunk = String()
    end_of_stream = Boolean()
 ############################################################################
@ -29,8 +32,11 @@ class DocumentRagQuery(Record):
    user = String()
    collection = String()
    doc_limit = Integer()
    streaming = Boolean()
 class DocumentRagResponse(Record):
    error = Error()
    response = String()
    chunk = String()
    end_of_stream = Boolean()
--- a/trustgraph-bedrock/pyproject.toml
+++ b/trustgraph-bedrock/pyproject.toml
@ -10,7 +10,7 @@ description = "TrustGraph provides a means to run a pipeline of flexible AI proc
 readme = "README.md"
 requires-python = ">=3.8"
 dependencies = [
-    "trustgraph-base>=1.5,<1.6",
+    "trustgraph-base>=1.6,<1.7",
    "pulsar-client",
    "prometheus-client",
    "boto3",
--- a/trustgraph-bedrock/trustgraph/model/text_completion/bedrock/llm.py
+++ b/trustgraph-bedrock/trustgraph/model/text_completion/bedrock/llm.py
@ -11,7 +11,7 @@ import enum
 import logging
 from .... exceptions import TooManyRequests
-from .... base import LlmService, LlmResult
+from .... base import LlmService, LlmResult, LlmChunk
 # Module logger
 logger = logging.getLogger(__name__)
@ -21,8 +21,6 @@ default_ident = "text-completion"
 default_model = 'mistral.mistral-large-2407-v1:0'
 default_temperature = 0.0
 default_max_output = 2048
 default_top_p = 0.99
 default_top_k = 40
 # Actually, these could all just be None, no need to get environment
 # variables, as Boto3 would pick all these up if not passed in as args
@ -38,61 +36,60 @@ class ModelHandler:
    def __init__(self):
        self.temperature = default_temperature
        self.max_output = default_max_output
        self.top_p = default_top_p
        self.top_k = default_top_k
    def set_temperature(self, temperature):
        self.temperature = temperature
    def set_max_output(self, max_output):
        self.max_output = max_output
    def set_top_p(self, top_p):
        self.top_p = top_p
    def set_top_k(self, top_k):
        self.top_k = top_k
    def encode_request(self, system, prompt):
        raise RuntimeError("format_request not implemented")
    def decode_response(self, response):
        raise RuntimeError("format_request not implemented")
    def decode_stream_chunk(self, chunk):
        raise RuntimeError("decode_stream_chunk not implemented")
 class Mistral(ModelHandler):
    def __init__(self):
-        self.top_p = 0.99
+        pass
        self.top_k = 40
    def encode_request(self, system, prompt):
        return json.dumps({
            "prompt": f"{system}\n\n{prompt}",
            "max_tokens": self.max_output,
            "temperature": self.temperature,
            "top_p": self.top_p,
            "top_k": self.top_k,
        })
    def decode_response(self, response):
        response_body = json.loads(response.get("body").read())
        return response_body['outputs'][0]['text']
    def decode_stream_chunk(self, chunk):
        chunk_obj = json.loads(chunk.get('chunk').get('bytes').decode())
        if 'outputs' in chunk_obj and len(chunk_obj['outputs']) > 0:
            return chunk_obj['outputs'][0].get('text', '')
        return ''
 # Llama 3
 class Meta(ModelHandler):
    def __init__(self):
-        self.top_p = 0.95
+        pass
    def encode_request(self, system, prompt):
        return json.dumps({
            "prompt": f"{system}\n\n{prompt}",
            "max_gen_len": self.max_output,
            "temperature": self.temperature,
            "top_p": self.top_p,
        })
    def decode_response(self, response):
        model_response = json.loads(response["body"].read())
        return model_response["generation"]
    def decode_stream_chunk(self, chunk):
        chunk_obj = json.loads(chunk.get('chunk').get('bytes').decode())
        return chunk_obj.get('generation', '')
 class Anthropic(ModelHandler):
    def __init__(self):
-        self.top_p = 0.999
+        pass
    def encode_request(self, system, prompt):
        return json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": self.max_output,
            "temperature": self.temperature,
            "top_p": self.top_p,
            "messages": [
                {
                    "role": "user",
@ -108,15 +105,20 @@ class Anthropic(ModelHandler):
    def decode_response(self, response):
        model_response = json.loads(response["body"].read())
        return model_response['content'][0]['text']
    def decode_stream_chunk(self, chunk):
        chunk_obj = json.loads(chunk.get('chunk').get('bytes').decode())
        if chunk_obj.get('type') == 'content_block_delta':
            if 'delta' in chunk_obj and 'text' in chunk_obj['delta']:
                return chunk_obj['delta']['text']
        return ''
 class Ai21(ModelHandler):
    def __init__(self):
-        self.top_p = 0.9
+        pass
    def encode_request(self, system, prompt):
        return json.dumps({
            "max_tokens": self.max_output,
            "temperature": self.temperature,
            "top_p": self.top_p,
            "messages": [
                {
                    "role": "user",
@ -129,6 +131,12 @@ class Ai21(ModelHandler):
        content_str = content.decode('utf-8')
        content_json = json.loads(content_str)
        return content_json['choices'][0]['message']['content']
    def decode_stream_chunk(self, chunk):
        chunk_obj = json.loads(chunk.get('chunk').get('bytes').decode())
        if 'choices' in chunk_obj and len(chunk_obj['choices']) > 0:
            delta = chunk_obj['choices'][0].get('delta', {})
            return delta.get('content', '')
        return ''
 class Cohere(ModelHandler):
    def encode_request(self, system, prompt):
@ -142,6 +150,9 @@ class Cohere(ModelHandler):
        content_str = content.decode('utf-8')
        content_json = json.loads(content_str)
        return content_json['text']
    def decode_stream_chunk(self, chunk):
        chunk_obj = json.loads(chunk.get('chunk').get('bytes').decode())
        return chunk_obj.get('text', '')
 Default=Mistral
@ -205,30 +216,17 @@ class Processor(LlmService):
    def determine_variant(self, model):
-        # FIXME: Missing, Amazon models, Deepseek
+        if ".anthropic." in model or model.startswith("anthropic"):
-
+            return Anthropic
-        # This set of conditions deals with normal bedrock on-demand usage
+        elif ".meta." in model or model.startswith("meta"):
-        if model.startswith("mistral"):
+            return Meta
        elif ".mistral." in model or model.startswith("mistral"):
            return Mistral
-        elif model.startswith("meta"):
+        elif ".ai21." in model or model.startswith("ai21"):
            return Meta
        elif model.startswith("anthropic"):
            return Anthropic
        elif model.startswith("ai21"):
            return Ai21
-        elif model.startswith("cohere"):
+        elif ".cohere." in model or model.startswith("cohere"):
            return Cohere
        # The inference profiles
        if model.startswith("us.meta"):
            return Meta
        elif model.startswith("us.anthropic"):
            return Anthropic
        elif model.startswith("eu.meta"):
            return Meta
        elif model.startswith("eu.anthropic"):
            return Anthropic
        return Default
    def _get_or_create_variant(self, model_name, temperature=None):
@ -309,6 +307,78 @@ class Processor(LlmService):
            logger.error(f"Bedrock LLM exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    def supports_streaming(self):
        """Bedrock supports streaming"""
        return True
    async def generate_content_stream(self, system, prompt, model=None, temperature=None):
        """Stream content generation from Bedrock"""
        model_name = model or self.default_model
        effective_temperature = temperature if temperature is not None else self.temperature
        logger.debug(f"Using model (streaming): {model_name}")
        logger.debug(f"Using temperature: {effective_temperature}")
        try:
            variant = self._get_or_create_variant(model_name, effective_temperature)
            promptbody = variant.encode_request(system, prompt)
            accept = 'application/json'
            contentType = 'application/json'
            response = self.bedrock.invoke_model_with_response_stream(
                body=promptbody,
                modelId=model_name,
                accept=accept,
                contentType=contentType
            )
            total_input_tokens = 0
            total_output_tokens = 0
            stream = response.get('body')
            if stream:
                for event in stream:
                    chunk = event.get('chunk')
                    if chunk:
                        # Decode the chunk text
                        text = variant.decode_stream_chunk(event)
                        if text:
                            yield LlmChunk(
                                text=text,
                                in_token=None,
                                out_token=None,
                                model=model_name,
                                is_final=False
                            )
                    # Try to extract metadata from the event
                    metadata = event.get('metadata')
                    if metadata:
                        usage = metadata.get('usage')
                        if usage:
                            total_input_tokens = usage.get('inputTokens', 0)
                            total_output_tokens = usage.get('outputTokens', 0)
            # Send final chunk with token counts
            yield LlmChunk(
                text="",
                in_token=total_input_tokens,
                out_token=total_output_tokens,
                model=model_name,
                is_final=True
            )
            logger.debug("Streaming complete")
        except self.bedrock.exceptions.ThrottlingException as e:
            logger.warning(f"Hit rate limit during streaming: {e}")
            raise TooManyRequests()
        except Exception as e:
            logger.error(f"Bedrock streaming exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    @staticmethod
    def add_args(parser):
--- a/trustgraph-cli/pyproject.toml
+++ b/trustgraph-cli/pyproject.toml
@ -10,7 +10,7 @@ description = "TrustGraph provides a means to run a pipeline of flexible AI proc
 readme = "README.md"
 requires-python = ">=3.8"
 dependencies = [
-    "trustgraph-base>=1.5,<1.6",
+    "trustgraph-base>=1.6,<1.7",
    "requests",
    "pulsar-client",
    "aiohttp",
@ -34,6 +34,7 @@ tg-delete-mcp-tool = "trustgraph.cli.delete_mcp_tool:main"
 tg-delete-kg-core = "trustgraph.cli.delete_kg_core:main"
 tg-delete-tool = "trustgraph.cli.delete_tool:main"
 tg-dump-msgpack = "trustgraph.cli.dump_msgpack:main"
 tg-dump-queues = "trustgraph.cli.dump_queues:main"
 tg-get-flow-class = "trustgraph.cli.get_flow_class:main"
 tg-get-kg-core = "trustgraph.cli.get_kg_core:main"
 tg-graph-to-turtle = "trustgraph.cli.graph_to_turtle:main"
--- a/trustgraph-cli/trustgraph/cli/dump_queues.py
+++ b/trustgraph-cli/trustgraph/cli/dump_queues.py
@ -0,0 +1,362 @@
 """
 Multi-queue Pulsar message dumper for debugging TrustGraph message flows.
 This utility monitors multiple Pulsar queues simultaneously and logs all messages
 to a file with timestamps and pretty-printed formatting. Useful for debugging
 message flows, diagnosing stuck services, and understanding system behavior.
 Uses TrustGraph's Subscriber abstraction for future-proof pub/sub compatibility.
 """
 import pulsar
 from pulsar.schema import BytesSchema
 import sys
 import json
 import asyncio
 from datetime import datetime
 import argparse
 from trustgraph.base.subscriber import Subscriber
 def format_message(queue_name, msg):
    """Format a message with timestamp and queue name."""
    timestamp = datetime.now().isoformat()
    # Try to parse as JSON and pretty-print
    try:
        # Handle both Message objects and raw bytes
        if hasattr(msg, 'value'):
            # Message object with .value() method
            value = msg.value()
        else:
            # Raw bytes from schema-less subscription
            value = msg
        # If it's bytes, decode it
        if isinstance(value, bytes):
            value = value.decode('utf-8')
        # If it's a string, try to parse as JSON
        if isinstance(value, str):
            try:
                parsed = json.loads(value)
                body = json.dumps(parsed, indent=2)
            except (json.JSONDecodeError, TypeError):
                body = value
        else:
            # Try to convert to dict for pretty printing
            try:
                # Pulsar schema objects have __dict__ or similar
                if hasattr(value, '__dict__'):
                    parsed = {k: v for k, v in value.__dict__.items()
                             if not k.startswith('_')}
                else:
                    parsed = str(value)
                body = json.dumps(parsed, indent=2, default=str)
            except (TypeError, AttributeError):
                body = str(value)
    except Exception as e:
        body = f"<Error formatting message: {e}>\n{str(msg)}"
    # Format the output
    header = f"\n{'='*80}\n[{timestamp}] Queue: {queue_name}\n{'='*80}\n"
    return header + body + "\n"
 async def monitor_queue(subscriber, queue_name, central_queue, monitor_id, shutdown_event):
    """
    Monitor a single queue via Subscriber and forward messages to central queue.
    Args:
        subscriber: Subscriber instance for this queue
        queue_name: Name of the queue (for logging)
        central_queue: asyncio.Queue to forward messages to
        monitor_id: Unique ID for this monitor's subscription
        shutdown_event: asyncio.Event to signal shutdown
    """
    msg_queue = None
    try:
        # Subscribe to all messages from this Subscriber
        msg_queue = await subscriber.subscribe_all(monitor_id)
        while not shutdown_event.is_set():
            try:
                # Read from Subscriber's internal queue with timeout
                msg = await asyncio.wait_for(msg_queue.get(), timeout=0.5)
                timestamp = datetime.now()
                formatted = format_message(queue_name, msg)
                # Forward to central queue for writing
                await central_queue.put((timestamp, queue_name, formatted))
            except asyncio.TimeoutError:
                # No message, check shutdown flag again
                continue
    except Exception as e:
        if not shutdown_event.is_set():
            error_msg = f"\n{'='*80}\n[{datetime.now().isoformat()}] ERROR in monitor for {queue_name}\n{'='*80}\n{e}\n"
            await central_queue.put((datetime.now(), queue_name, error_msg))
    finally:
        # Clean unsubscribe
        if msg_queue is not None:
            try:
                await subscriber.unsubscribe_all(monitor_id)
            except Exception:
                pass
 async def log_writer(central_queue, file_handle, shutdown_event, console_output=True):
    """
    Write messages from central queue to file.
    Args:
        central_queue: asyncio.Queue containing (timestamp, queue_name, formatted_msg) tuples
        file_handle: Open file handle to write to
        shutdown_event: asyncio.Event to signal shutdown
        console_output: Whether to print abbreviated messages to console
    """
    try:
        while not shutdown_event.is_set():
            try:
                # Wait for messages with timeout to check shutdown flag
                timestamp, queue_name, formatted_msg = await asyncio.wait_for(
                    central_queue.get(), timeout=0.5
                )
                # Write to file
                file_handle.write(formatted_msg)
                file_handle.flush()
                # Print abbreviated message to console
                if console_output:
                    time_str = timestamp.strftime('%H:%M:%S')
                    print(f"[{time_str}] {queue_name}: Message received")
            except asyncio.TimeoutError:
                # No message, check shutdown flag again
                continue
    finally:
        # Flush remaining messages after shutdown
        while not central_queue.empty():
            try:
                timestamp, queue_name, formatted_msg = central_queue.get_nowait()
                file_handle.write(formatted_msg)
                file_handle.flush()
            except asyncio.QueueEmpty:
                break
 async def async_main(queues, output_file, pulsar_host, listener_name, subscriber_name, append_mode):
    """
    Main async function to monitor multiple queues concurrently.
    Args:
        queues: List of queue names to monitor
        output_file: Path to output file
        pulsar_host: Pulsar connection URL
        listener_name: Pulsar listener name
        subscriber_name: Base name for subscribers
        append_mode: Whether to append to existing file
    """
    print(f"TrustGraph Queue Dumper")
    print(f"Monitoring {len(queues)} queue(s):")
    for q in queues:
        print(f"  - {q}")
    print(f"Output file: {output_file}")
    print(f"Mode: {'append' if append_mode else 'overwrite'}")
    print(f"Press Ctrl+C to stop\n")
    # Connect to Pulsar
    try:
        client = pulsar.Client(pulsar_host, listener_name=listener_name)
    except Exception as e:
        print(f"Error connecting to Pulsar at {pulsar_host}: {e}", file=sys.stderr)
        sys.exit(1)
    # Create Subscribers and central queue
    central_queue = asyncio.Queue()
    subscribers = []
    for queue_name in queues:
        try:
            sub = Subscriber(
                client=client,
                topic=queue_name,
                subscription=subscriber_name,
                consumer_name=f"{subscriber_name}-{queue_name}",
                schema=None,  # No schema - accept any message type
            )
            await sub.start()
            subscribers.append((queue_name, sub))
            print(f"✓ Subscribed to: {queue_name}")
        except Exception as e:
            print(f"✗ Error subscribing to {queue_name}: {e}", file=sys.stderr)
    if not subscribers:
        print("\nNo subscribers created. Exiting.", file=sys.stderr)
        client.close()
        sys.exit(1)
    print(f"\nListening for messages...\n")
    # Open output file
    mode = 'a' if append_mode else 'w'
    try:
        with open(output_file, mode) as f:
            f.write(f"\n{'#'*80}\n")
            f.write(f"# Session started: {datetime.now().isoformat()}\n")
            f.write(f"# Monitoring queues: {', '.join(queues)}\n")
            f.write(f"{'#'*80}\n")
            f.flush()
            # Create shutdown event for clean coordination
            shutdown_event = asyncio.Event()
            # Start monitoring tasks
            tasks = []
            try:
                # Create one monitor task per subscriber
                for queue_name, sub in subscribers:
                    task = asyncio.create_task(
                        monitor_queue(sub, queue_name, central_queue, "logger", shutdown_event)
                    )
                    tasks.append(task)
                # Create single writer task
                writer_task = asyncio.create_task(
                    log_writer(central_queue, f, shutdown_event)
                )
                tasks.append(writer_task)
                # Wait for all tasks (they check shutdown_event)
                await asyncio.gather(*tasks)
            except KeyboardInterrupt:
                print("\n\nStopping...")
            finally:
                # Signal shutdown to all tasks
                shutdown_event.set()
                # Wait for tasks to finish cleanly (with timeout)
                try:
                    await asyncio.wait_for(asyncio.gather(*tasks, return_exceptions=True), timeout=2.0)
                except asyncio.TimeoutError:
                    print("Warning: Shutdown timeout", file=sys.stderr)
                # Write session end marker
                f.write(f"\n{'#'*80}\n")
                f.write(f"# Session ended: {datetime.now().isoformat()}\n")
                f.write(f"{'#'*80}\n")
    except IOError as e:
        print(f"Error writing to {output_file}: {e}", file=sys.stderr)
        sys.exit(1)
    finally:
        # Clean shutdown of Subscribers
        for _, sub in subscribers:
            await sub.stop()
        client.close()
    print(f"\nMessages logged to: {output_file}")
 def main():
    parser = argparse.ArgumentParser(
        prog='tg-dump-queues',
        description='Monitor and dump messages from multiple Pulsar queues',
        epilog="""
 Examples:
  # Monitor agent and prompt queues
  tg-dump-queues non-persistent://tg/request/agent:default \\
                 non-persistent://tg/request/prompt:default
  # Monitor with custom output file
  tg-dump-queues non-persistent://tg/request/agent:default \\
                 --output debug.log
  # Append to existing log file
  tg-dump-queues non-persistent://tg/request/agent:default \\
                 --output queue.log --append
 Common queue patterns:
  - Agent requests:   non-persistent://tg/request/agent:default
  - Agent responses:  non-persistent://tg/response/agent:default
  - Prompt requests:  non-persistent://tg/request/prompt:default
  - Prompt responses: non-persistent://tg/response/prompt:default
  - LLM requests:     non-persistent://tg/request/text-completion:default
  - LLM responses:    non-persistent://tg/response/text-completion:default
 IMPORTANT:
  This tool subscribes to queues without a schema (schema-less mode). To avoid
  schema conflicts, ensure that TrustGraph services and flows are already started
  before running this tool. If this tool subscribes first, the real services may
  encounter schema mismatch errors when they try to connect.
  Best practice: Start services → Set up flows → Run tg-dump-queues
        """,
        formatter_class=argparse.RawDescriptionHelpFormatter
    )
    parser.add_argument(
        'queues',
        nargs='+',
        help='Pulsar queue names to monitor'
    )
    parser.add_argument(
        '--output', '-o',
        default='queue.log',
        help='Output file (default: queue.log)'
    )
    parser.add_argument(
        '--append', '-a',
        action='store_true',
        help='Append to output file instead of overwriting'
    )
    parser.add_argument(
        '--pulsar-host',
        default='pulsar://localhost:6650',
        help='Pulsar host URL (default: pulsar://localhost:6650)'
    )
    parser.add_argument(
        '--listener-name',
        default='localhost',
        help='Pulsar listener name (default: localhost)'
    )
    parser.add_argument(
        '--subscriber',
        default='debug',
        help='Subscriber name for queue subscription (default: debug)'
    )
    args = parser.parse_args()
    # Filter out any accidentally included flags
    queues = [q for q in args.queues if not q.startswith('--')]
    if not queues:
        parser.error("No queues specified")
    # Run async main
    try:
        asyncio.run(async_main(
            queues=queues,
            output_file=args.output,
            pulsar_host=args.pulsar_host,
            listener_name=args.listener_name,
            subscriber_name=args.subscriber,
            append_mode=args.append
        ))
    except KeyboardInterrupt:
        # Already handled in async_main
        pass
    except Exception as e:
        print(f"Fatal error: {e}", file=sys.stderr)
        sys.exit(1)
 if __name__ == '__main__':
    main()
--- a/trustgraph-cli/trustgraph/cli/invoke_agent.py
+++ b/trustgraph-cli/trustgraph/cli/invoke_agent.py
@ -14,6 +14,78 @@ default_url = os.getenv("TRUSTGRAPH_URL", 'ws://localhost:8088/')
 default_user = 'trustgraph'
 default_collection = 'default'
 class Outputter:
    def __init__(self, width=75, prefix="> "):
        self.width = width
        self.prefix = prefix
        self.column = 0
        self.word_buffer = ""
        self.just_wrapped = False
    def __enter__(self):
        # Print prefix at start of first line
        print(self.prefix, end="", flush=True)
        self.column = len(self.prefix)
        return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        # Flush remaining word buffer
        if self.word_buffer:
            print(self.word_buffer, end="", flush=True)
            self.column += len(self.word_buffer)
            self.word_buffer = ""
        # Add final newline if not at line start
        if self.column > 0:
            print(flush=True)
            self.column = 0
    def output(self, text):
        for char in text:
            # Handle whitespace (space/tab)
            if char in (' ', '\t'):
                # Flush word buffer if present
                if self.word_buffer:
                    # Check if word + space would exceed width
                    if self.column + len(self.word_buffer) + 1 > self.width:
                        # Wrap: newline + prefix
                        print(flush=True)
                        print(self.prefix, end="", flush=True)
                        self.column = len(self.prefix)
                        self.just_wrapped = True
                    # Output word buffer
                    print(self.word_buffer, end="", flush=True)
                    self.column += len(self.word_buffer)
                    self.word_buffer = ""
                # Output the space
                print(char, end="", flush=True)
                self.column += 1
                self.just_wrapped = False
            # Handle newline
            elif char == '\n':
                if self.just_wrapped:
                    # Skip this newline (already wrapped)
                    self.just_wrapped = False
                else:
                    # Flush word buffer if any
                    if self.word_buffer:
                        print(self.word_buffer, end="", flush=True)
                        self.word_buffer = ""
                    # Output newline + prefix
                    print(flush=True)
                    print(self.prefix, end="", flush=True)
                    self.column = len(self.prefix)
                    self.just_wrapped = False
            # Regular character - add to word buffer
            else:
                self.word_buffer += char
                self.just_wrapped = False
 def wrap(text, width=75):
    if text is None: text = "n/a"
    out = textwrap.wrap(
@ -29,7 +101,7 @@ def output(text, prefix="> ", width=78):
 async def question(
        url, question, flow_id, user, collection,
-        plan=None, state=None, group=None, verbose=False
+        plan=None, state=None, group=None, verbose=False, streaming=True
 ):
    if not url.endswith("/"):
@ -41,6 +113,10 @@ async def question(
        output(wrap(question), "\U00002753 ")
        print()
    # Track last chunk type and current outputter for streaming
    last_chunk_type = None
    current_outputter = None
    def think(x):
        if verbose:
            output(wrap(x), "\U0001f914 ")
@ -62,16 +138,17 @@ async def question(
            "request": {
                "question": question,
                "user": user,
-                "history": []
+                "history": [],
                "streaming": streaming
            }
        }
-        
+
        # Only add optional fields if they have values
        if state is not None:
            req["request"]["state"] = state
        if group is not None:
            req["request"]["group"] = group
-            
+
        req = json.dumps(req)
        await ws.send(req)
@ -89,16 +166,60 @@ async def question(
                print("Ignore message")
                continue
-            if "thought" in obj["response"]:
+            response = obj["response"]
                think(obj["response"]["thought"])
-            if "observation" in obj["response"]:
+            # Handle streaming format (new format with chunk_type)
-                observe(obj["response"]["observation"])
+            if "chunk_type" in response:
                chunk_type = response["chunk_type"]
                content = response.get("content", "")
-            if "answer" in obj["response"]:
+                # Check if we're switching to a new message type
-                print(obj["response"]["answer"])
+                if last_chunk_type != chunk_type:
                    # Close previous outputter if exists
                    if current_outputter:
                        current_outputter.__exit__(None, None, None)
                        current_outputter = None
                        print()  # Blank line between message types
-            if obj["complete"]: break
+                    # Create new outputter for new message type
                    if chunk_type == "thought" and verbose:
                        current_outputter = Outputter(width=78, prefix="\U0001f914  ")
                        current_outputter.__enter__()
                    elif chunk_type == "observation" and verbose:
                        current_outputter = Outputter(width=78, prefix="\U0001f4a1  ")
                        current_outputter.__enter__()
                    # For answer, don't use Outputter - just print as-is
                    last_chunk_type = chunk_type
                # Output the chunk
                if current_outputter:
                    current_outputter.output(content)
                elif chunk_type == "answer":
                    print(content, end="", flush=True)
            else:
                # Handle legacy format (backward compatibility)
                if "thought" in response:
                    think(response["thought"])
                if "observation" in response:
                    observe(response["observation"])
                if "answer" in response:
                    print(response["answer"])
                if "error" in response:
                    raise RuntimeError(response["error"])
            if obj["complete"]:
                # Close any remaining outputter
                if current_outputter:
                    current_outputter.__exit__(None, None, None)
                    current_outputter = None
                # Add final newline if we were outputting answer
                elif last_chunk_type == "answer":
                    print()
                break
        await ws.close()
@ -161,6 +282,12 @@ def main():
        help=f'Output thinking/observations'
    )
    parser.add_argument(
        '--no-streaming',
        action="store_true",
        help=f'Disable streaming (use legacy mode)'
    )
    args = parser.parse_args()
    try:
@ -176,6 +303,7 @@ def main():
                state = args.state,
                group = args.group,
                verbose = args.verbose,
                streaming = not args.no_streaming,
            )
        )
@ -184,4 +312,4 @@ def main():
        print("Exception:", e, flush=True)
 if __name__ == "__main__":
-    main()
+    main()
--- a/trustgraph-cli/trustgraph/cli/invoke_document_rag.py
+++ b/trustgraph-cli/trustgraph/cli/invoke_document_rag.py
@ -4,6 +4,10 @@ Uses the DocumentRAG service to answer a question
 import argparse
 import os
 import asyncio
 import json
 import uuid
 from websockets.asyncio.client import connect
 from trustgraph.api import Api
 default_url = os.getenv("TRUSTGRAPH_URL", 'http://localhost:8088/')
@ -11,7 +15,69 @@ default_user = 'trustgraph'
 default_collection = 'default'
 default_doc_limit = 10
-def question(url, flow_id, question, user, collection, doc_limit):
+async def question_streaming(url, flow_id, question, user, collection, doc_limit):
    """Streaming version using websockets"""
    # Convert http:// to ws://
    if url.startswith('http://'):
        url = 'ws://' + url[7:]
    elif url.startswith('https://'):
        url = 'wss://' + url[8:]
    if not url.endswith("/"):
        url += "/"
    url = url + "api/v1/socket"
    mid = str(uuid.uuid4())
    async with connect(url) as ws:
        req = {
            "id": mid,
            "service": "document-rag",
            "flow": flow_id,
            "request": {
                "query": question,
                "user": user,
                "collection": collection,
                "doc-limit": doc_limit,
                "streaming": True
            }
        }
        req = json.dumps(req)
        await ws.send(req)
        while True:
            msg = await ws.recv()
            obj = json.loads(msg)
            if "error" in obj:
                raise RuntimeError(obj["error"])
            if obj["id"] != mid:
                print("Ignore message")
                continue
            response = obj["response"]
            # Handle streaming format (chunk)
            if "chunk" in response:
                chunk = response["chunk"]
                print(chunk, end="", flush=True)
            elif "response" in response:
                # Final response with complete text
                # Already printed via chunks, just add newline
                pass
            if obj["complete"]:
                print()  # Final newline
                break
        await ws.close()
 def question_non_streaming(url, flow_id, question, user, collection, doc_limit):
    """Non-streaming version using HTTP API"""
    api = Api(url).flow().id(flow_id)
@ -65,18 +131,36 @@ def main():
        help=f'Document limit (default: {default_doc_limit})'
    )
    parser.add_argument(
        '--no-streaming',
        action='store_true',
        help='Disable streaming (use non-streaming mode)'
    )
    args = parser.parse_args()
    try:
-        question(
+        if not args.no_streaming:
-            url=args.url,
+            asyncio.run(
-            flow_id = args.flow_id,
+                question_streaming(
-            question=args.question,
+                    url=args.url,
-            user=args.user,
+                    flow_id=args.flow_id,
-            collection=args.collection,
+                    question=args.question,
-            doc_limit=args.doc_limit,
+                    user=args.user,
-        )
+                    collection=args.collection,
                    doc_limit=args.doc_limit,
                )
            )
        else:
            question_non_streaming(
                url=args.url,
                flow_id=args.flow_id,
                question=args.question,
                user=args.user,
                collection=args.collection,
                doc_limit=args.doc_limit,
            )
    except Exception as e:
--- a/trustgraph-cli/trustgraph/cli/invoke_graph_rag.py
+++ b/trustgraph-cli/trustgraph/cli/invoke_graph_rag.py
@ -4,6 +4,10 @@ Uses the GraphRAG service to answer a question
 import argparse
 import os
 import asyncio
 import json
 import uuid
 from websockets.asyncio.client import connect
 from trustgraph.api import Api
 default_url = os.getenv("TRUSTGRAPH_URL", 'http://localhost:8088/')
@ -14,10 +18,78 @@ default_triple_limit = 30
 default_max_subgraph_size = 150
 default_max_path_length = 2
-def question(
+async def question_streaming(
        url, flow_id, question, user, collection, entity_limit, triple_limit,
        max_subgraph_size, max_path_length
 ):
    """Streaming version using websockets"""
    # Convert http:// to ws://
    if url.startswith('http://'):
        url = 'ws://' + url[7:]
    elif url.startswith('https://'):
        url = 'wss://' + url[8:]
    if not url.endswith("/"):
        url += "/"
    url = url + "api/v1/socket"
    mid = str(uuid.uuid4())
    async with connect(url) as ws:
        req = {
            "id": mid,
            "service": "graph-rag",
            "flow": flow_id,
            "request": {
                "query": question,
                "user": user,
                "collection": collection,
                "entity-limit": entity_limit,
                "triple-limit": triple_limit,
                "max-subgraph-size": max_subgraph_size,
                "max-path-length": max_path_length,
                "streaming": True
            }
        }
        req = json.dumps(req)
        await ws.send(req)
        while True:
            msg = await ws.recv()
            obj = json.loads(msg)
            if "error" in obj:
                raise RuntimeError(obj["error"])
            if obj["id"] != mid:
                print("Ignore message")
                continue
            response = obj["response"]
            # Handle streaming format (chunk)
            if "chunk" in response:
                chunk = response["chunk"]
                print(chunk, end="", flush=True)
            elif "response" in response:
                # Final response with complete text
                # Already printed via chunks, just add newline
                pass
            if obj["complete"]:
                print()  # Final newline
                break
        await ws.close()
 def question_non_streaming(
        url, flow_id, question, user, collection, entity_limit, triple_limit,
        max_subgraph_size, max_path_length
 ):
    """Non-streaming version using HTTP API"""
    api = Api(url).flow().id(flow_id)
@ -91,21 +163,42 @@ def main():
        help=f'Max path length (default: {default_max_path_length})'
    )
    parser.add_argument(
        '--no-streaming',
        action='store_true',
        help='Disable streaming (use non-streaming mode)'
    )
    args = parser.parse_args()
    try:
-        question(
+        if not args.no_streaming:
-            url=args.url,
+            asyncio.run(
-            flow_id = args.flow_id,
+                question_streaming(
-            question=args.question,
+                    url=args.url,
-            user=args.user,
+                    flow_id=args.flow_id,
-            collection=args.collection,
+                    question=args.question,
-            entity_limit=args.entity_limit,
+                    user=args.user,
-            triple_limit=args.triple_limit,
+                    collection=args.collection,
-            max_subgraph_size=args.max_subgraph_size,
+                    entity_limit=args.entity_limit,
-            max_path_length=args.max_path_length,
+                    triple_limit=args.triple_limit,
-        )
+                    max_subgraph_size=args.max_subgraph_size,
                    max_path_length=args.max_path_length,
                )
            )
        else:
            question_non_streaming(
                url=args.url,
                flow_id=args.flow_id,
                question=args.question,
                user=args.user,
                collection=args.collection,
                entity_limit=args.entity_limit,
                triple_limit=args.triple_limit,
                max_subgraph_size=args.max_subgraph_size,
                max_path_length=args.max_path_length,
            )
    except Exception as e:
--- a/trustgraph-cli/trustgraph/cli/invoke_llm.py
+++ b/trustgraph-cli/trustgraph/cli/invoke_llm.py
@ -6,17 +6,63 @@ and user prompt.  Both arguments are required.
 import argparse
 import os
 import json
-from trustgraph.api import Api
+import uuid
 import asyncio
 from websockets.asyncio.client import connect
-default_url = os.getenv("TRUSTGRAPH_URL", 'http://localhost:8088/')
+default_url = os.getenv("TRUSTGRAPH_URL", 'ws://localhost:8088/')
-def query(url, flow_id, system, prompt):
+async def query(url, flow_id, system, prompt, streaming=True):
-    api = Api(url).flow().id(flow_id)
+    if not url.endswith("/"):
        url += "/"
-    resp = api.text_completion(system=system, prompt=prompt)
+    url = url + "api/v1/socket"
-    print(resp)
+    mid = str(uuid.uuid4())
    async with connect(url) as ws:
        req = {
            "id": mid,
            "service": "text-completion",
            "flow": flow_id,
            "request": {
                "system": system,
                "prompt": prompt,
                "streaming": streaming
            }
        }
        await ws.send(json.dumps(req))
        while True:
            msg = await ws.recv()
            obj = json.loads(msg)
            if "error" in obj:
                raise RuntimeError(obj["error"])
            if obj["id"] != mid:
                continue
            if "response" in obj["response"]:
                if streaming:
                    # Stream output to stdout without newline
                    print(obj["response"]["response"], end="", flush=True)
                else:
                    # Non-streaming: print complete response
                    print(obj["response"]["response"])
            if obj["complete"]:
                if streaming:
                    # Add final newline after streaming
                    print()
                break
        await ws.close()
 def main():
@ -49,16 +95,23 @@ def main():
        help=f'Flow ID (default: default)'
    )
    parser.add_argument(
        '--no-streaming',
        action='store_true',
        help='Disable streaming (default: streaming enabled)'
    )
    args = parser.parse_args()
    try:
-        query(
+        asyncio.run(query(
            url=args.url,
-            flow_id = args.flow_id,
+            flow_id=args.flow_id,
            system=args.system[0],
            prompt=args.prompt[0],
-        )
+            streaming=not args.no_streaming
        ))
    except Exception as e:
--- a/trustgraph-cli/trustgraph/cli/invoke_prompt.py
+++ b/trustgraph-cli/trustgraph/cli/invoke_prompt.py
@ -10,20 +10,76 @@ using key=value arguments on the command line, and these replace
 import argparse
 import os
 import json
-from trustgraph.api import Api
+import uuid
 import asyncio
 from websockets.asyncio.client import connect
-default_url = os.getenv("TRUSTGRAPH_URL", 'http://localhost:8088/')
+default_url = os.getenv("TRUSTGRAPH_URL", 'ws://localhost:8088/')
-def query(url, flow_id, template_id, variables):
+async def query(url, flow_id, template_id, variables, streaming=True):
-    api = Api(url).flow().id(flow_id)
+    if not url.endswith("/"):
        url += "/"
-    resp = api.prompt(id=template_id, variables=variables)
+    url = url + "api/v1/socket"
-    if isinstance(resp, str):
+    mid = str(uuid.uuid4())
-        print(resp)
+
-    else:
+    async with connect(url) as ws:
-        print(json.dumps(resp, indent=4))
+
        req = {
            "id": mid,
            "service": "prompt",
            "flow": flow_id,
            "request": {
                "id": template_id,
                "variables": variables,
                "streaming": streaming
            }
        }
        await ws.send(json.dumps(req))
        full_response = {"text": "", "object": ""}
        while True:
            msg = await ws.recv()
            obj = json.loads(msg)
            if "error" in obj:
                raise RuntimeError(obj["error"])
            if obj["id"] != mid:
                continue
            response = obj["response"]
            # Handle text responses (streaming)
            if "text" in response and response["text"]:
                if streaming:
                    # Stream output to stdout without newline
                    print(response["text"], end="", flush=True)
                    full_response["text"] += response["text"]
                else:
                    # Non-streaming: print complete response
                    print(response["text"])
            # Handle object responses (JSON, never streamed)
            if "object" in response and response["object"]:
                full_response["object"] = response["object"]
            if obj["complete"]:
                if streaming and full_response["text"]:
                    # Add final newline after streaming text
                    print()
                elif full_response["object"]:
                    # Print JSON object (pretty-printed)
                    print(json.dumps(json.loads(full_response["object"]), indent=4))
                break
        await ws.close()
 def main():
@ -59,6 +115,12 @@ def main():
 specified multiple times''',
    )
    parser.add_argument(
        '--no-streaming',
        action='store_true',
        help='Disable streaming (default: streaming enabled for text responses)'
    )
    args = parser.parse_args()
    variables = {}
@ -73,12 +135,13 @@ specified multiple times''',
    try:
-        query(
+        asyncio.run(query(
            url=args.url,
            flow_id=args.flow_id,
            template_id=args.id[0],
            variables=variables,
-        )
+            streaming=not args.no_streaming
        ))
    except Exception as e:
--- a/trustgraph-embeddings-hf/pyproject.toml
+++ b/trustgraph-embeddings-hf/pyproject.toml
@ -10,8 +10,8 @@ description = "HuggingFace embeddings support for TrustGraph."
 readme = "README.md"
 requires-python = ">=3.8"
 dependencies = [
-    "trustgraph-base>=1.5,<1.6",
+    "trustgraph-base>=1.6,<1.7",
-    "trustgraph-flow>=1.5,<1.6",
+    "trustgraph-flow>=1.6,<1.7",
    "torch",
    "urllib3",
    "transformers",
--- a/trustgraph-flow/pyproject.toml
+++ b/trustgraph-flow/pyproject.toml
@ -10,7 +10,7 @@ description = "TrustGraph provides a means to run a pipeline of flexible AI proc
 readme = "README.md"
 requires-python = ">=3.8"
 dependencies = [
-    "trustgraph-base>=1.5,<1.6",
+    "trustgraph-base>=1.6,<1.7",
    "aiohttp",
    "anthropic",
    "scylla-driver",
--- a/trustgraph-flow/trustgraph/agent/react/agent_manager.py
+++ b/trustgraph-flow/trustgraph/agent/react/agent_manager.py
@ -2,6 +2,7 @@
 import logging
 import json
 import re
 import asyncio
 from . types import Action, Final
@ -169,7 +170,7 @@ class AgentManager:
        raise ValueError(f"Could not parse response: {text}")
-    async def reason(self, question, history, context):
+    async def reason(self, question, history, context, streaming=False, think=None, observe=None, answer=None):
        logger.debug(f"calling reason: {question}")
@ -219,25 +220,113 @@ class AgentManager:
        logger.info(f"prompt: {variables}")
-        # Get text response from prompt service
+        logger.info(f"DEBUG: streaming={streaming}, think={think is not None}")
        response_text = await context("prompt-request").agent_react(variables)
-        logger.debug(f"Response text:\n{response_text}")
+        # Streaming path - use StreamingReActParser
        if streaming and think:
            logger.info("DEBUG: Entering streaming path")
            from .streaming_parser import StreamingReActParser
-        logger.info(f"response: {response_text}")
+            logger.info("DEBUG: Creating StreamingReActParser")
            # Collect chunks to send via async callbacks
            thought_chunks = []
            answer_chunks = []
            # Create parser with synchronous callbacks that just collect chunks
            parser = StreamingReActParser(
                on_thought_chunk=lambda chunk: thought_chunks.append(chunk),
                on_answer_chunk=lambda chunk: answer_chunks.append(chunk),
            )
            logger.info("DEBUG: StreamingReActParser created")
            # Create async chunk callback that feeds parser and sends collected chunks
            async def on_chunk(text):
                logger.info(f"DEBUG: on_chunk called with {len(text)} chars")
                # Track what we had before
                prev_thought_count = len(thought_chunks)
                prev_answer_count = len(answer_chunks)
                # Feed the parser (synchronous)
                logger.info(f"DEBUG: About to call parser.feed")
                parser.feed(text)
                logger.info(f"DEBUG: parser.feed returned")
                # Send any new thought chunks
                for i in range(prev_thought_count, len(thought_chunks)):
                    logger.info(f"DEBUG: Sending thought chunk {i}")
                    # Mark last chunk as final if parser has moved out of THOUGHT state
                    is_last = (i == len(thought_chunks) - 1)
                    is_thought_complete = parser.state.value != "thought"
                    is_final = is_last and is_thought_complete
                    await think(thought_chunks[i], is_final=is_final)
                # Send any new answer chunks
                for i in range(prev_answer_count, len(answer_chunks)):
                    logger.info(f"DEBUG: Sending answer chunk {i}")
                    if answer:
                        await answer(answer_chunks[i])
                    else:
                        await think(answer_chunks[i])
            logger.info("DEBUG: Getting prompt-request client from context")
            client = context("prompt-request")
            logger.info(f"DEBUG: Got client: {client}")
            logger.info("DEBUG: About to call agent_react with streaming=True")
            # Get streaming response
            response_text = await client.agent_react(
                variables=variables,
                streaming=True,
                chunk_callback=on_chunk
            )
            logger.info(f"DEBUG: agent_react returned, got {len(response_text) if response_text else 0} chars")
            # Finalize parser
            logger.info("DEBUG: Finalizing parser")
            parser.finalize()
            logger.info("DEBUG: Parser finalized")
            # Get result
            logger.info("DEBUG: Getting result from parser")
            result = parser.get_result()
            if result is None:
                raise RuntimeError("Parser failed to produce a result")
        # Parse the text response
        try:
            result = self.parse_react_response(response_text)
            logger.info(f"Parsed result: {result}")
            return result
        except ValueError as e:
            logger.error(f"Failed to parse response: {e}")
            # Try to provide a helpful error message
            logger.error(f"Response was: {response_text}")
            raise RuntimeError(f"Failed to parse agent response: {e}")
-    async def react(self, question, history, think, observe, context):
+        else:
            logger.info("DEBUG: Entering NON-streaming path")
            # Non-streaming path - get complete text and parse
            logger.info("DEBUG: Getting prompt-request client from context")
            client = context("prompt-request")
            logger.info(f"DEBUG: Got client: {client}")
            logger.info("DEBUG: About to call agent_react with streaming=False")
            response_text = await client.agent_react(
                variables=variables,
                streaming=False
            )
            logger.info(f"DEBUG: agent_react returned, got response")
            logger.debug(f"Response text:\n{response_text}")
            logger.info(f"response: {response_text}")
            # Parse the text response
            try:
                result = self.parse_react_response(response_text)
                logger.info(f"Parsed result: {result}")
                return result
            except ValueError as e:
                logger.error(f"Failed to parse response: {e}")
                # Try to provide a helpful error message
                logger.error(f"Response was: {response_text}")
                raise RuntimeError(f"Failed to parse agent response: {e}")
    async def react(self, question, history, think, observe, context, streaming=False, answer=None):
        logger.info(f"question: {question}")
@ -245,17 +334,27 @@ class AgentManager:
            question = question,
            history = history,
            context = context,
            streaming = streaming,
            think = think,
            observe = observe,
            answer = answer,
        )
        logger.info(f"act: {act}")
        if isinstance(act, Final):
-            await think(act.thought)
+            # In non-streaming mode, send complete thought
            # In streaming mode, thoughts were already sent as chunks
            if not streaming:
                await think(act.thought, is_final=True)
            return act
        else:
-            await think(act.thought)
+            # In non-streaming mode, send complete thought
            # In streaming mode, thoughts were already sent as chunks
            if not streaming:
                await think(act.thought, is_final=True)
            logger.debug(f"ACTION: {act.name}")
@ -281,7 +380,7 @@ class AgentManager:
            logger.info(f"resp: {resp}")
-            await observe(resp)
+            await observe(resp, is_final=True)
            act.observation = resp
--- a/trustgraph-flow/trustgraph/agent/react/service.py
+++ b/trustgraph-flow/trustgraph/agent/react/service.py
@ -191,6 +191,9 @@ class Processor(AgentService):
        try:
            # Check if streaming is enabled
            streaming = getattr(request, 'streaming', False)
            if request.history:
                history = [
                    Action(
@ -211,29 +214,87 @@ class Processor(AgentService):
            logger.debug(f"History: {history}")
-            async def think(x):
+            async def think(x, is_final=False):
-                logger.debug(f"Think: {x}")
+                logger.debug(f"Think: {x} (is_final={is_final})")
-                r = AgentResponse(
+                if streaming:
-                    answer=None,
+                    # Streaming format
-                    error=None,
+                    r = AgentResponse(
-                    thought=x,
+                        chunk_type="thought",
-                    observation=None,
+                        content=x,
-                )
+                        end_of_message=is_final,
                        end_of_dialog=False,
                        # Legacy fields for backward compatibility
                        answer=None,
                        error=None,
                        thought=x,
                        observation=None,
                    )
                else:
                    # Legacy format
                    r = AgentResponse(
                        answer=None,
                        error=None,
                        thought=x,
                        observation=None,
                    )
                await respond(r)
-            async def observe(x):
+            async def observe(x, is_final=False):
-                logger.debug(f"Observe: {x}")
+                logger.debug(f"Observe: {x} (is_final={is_final})")
-                r = AgentResponse(
+                if streaming:
-                    answer=None,
+                    # Streaming format
-                    error=None,
+                    r = AgentResponse(
-                    thought=None,
+                        chunk_type="observation",
-                    observation=x,
+                        content=x,
-                )
+                        end_of_message=is_final,
                        end_of_dialog=False,
                        # Legacy fields for backward compatibility
                        answer=None,
                        error=None,
                        thought=None,
                        observation=x,
                    )
                else:
                    # Legacy format
                    r = AgentResponse(
                        answer=None,
                        error=None,
                        thought=None,
                        observation=x,
                    )
                await respond(r)
            async def answer(x):
                logger.debug(f"Answer: {x}")
                if streaming:
                    # Streaming format
                    r = AgentResponse(
                        chunk_type="answer",
                        content=x,
                        end_of_message=False,  # More chunks may follow
                        end_of_dialog=False,
                        # Legacy fields for backward compatibility
                        answer=None,
                        error=None,
                        thought=None,
                        observation=None,
                    )
                else:
                    # Legacy format - shouldn't be called in non-streaming mode
                    r = AgentResponse(
                        answer=x,
                        error=None,
                        thought=None,
                        observation=None,
                    )
                await respond(r)
@ -273,7 +334,9 @@ class Processor(AgentService):
                history = history,
                think = think,
                observe = observe,
                answer = answer,
                context = UserAwareContext(flow, request.user),
                streaming = streaming,
            )
            logger.debug(f"Action: {act}")
@ -287,11 +350,26 @@ class Processor(AgentService):
                else:
                    f = json.dumps(act.final)
-                r = AgentResponse(
+                if streaming:
-                    answer=act.final,
+                    # Streaming format - send end-of-dialog marker
-                    error=None,
+                    # Answer chunks were already sent via answer() callback during parsing
-                    thought=None,
+                    r = AgentResponse(
-                )
+                        chunk_type="answer",
                        content="",  # Empty content, just marking end of dialog
                        end_of_message=True,
                        end_of_dialog=True,
                        # Legacy fields set to None - answer already sent via streaming chunks
                        answer=None,
                        error=None,
                        thought=None,
                    )
                else:
                    # Legacy format - send complete answer
                    r = AgentResponse(
                        answer=act.final,
                        error=None,
                        thought=None,
                    )
                await respond(r)
@ -321,7 +399,9 @@ class Processor(AgentService):
                        observation=h.observation
                    )
                    for h in history
-                ]
+                ],
                user=request.user,
                streaming=streaming,
            )
            await next(r)
@ -336,14 +416,32 @@ class Processor(AgentService):
            logger.debug("Send error response...")
-            r = AgentResponse(
+            error_obj = Error(
-                error=Error(
+                type = "agent-error",
-                    type = "agent-error",
+                message = str(e),
                    message = str(e),
                ),
                response=None,
            )
            # Check if streaming was enabled (may not be set if error occurred early)
            streaming = getattr(request, 'streaming', False) if 'request' in locals() else False
            if streaming:
                # Streaming format
                r = AgentResponse(
                    chunk_type="error",
                    content=str(e),
                    end_of_message=True,
                    end_of_dialog=True,
                    # Legacy fields for backward compatibility
                    error=error_obj,
                    response=None,
                )
            else:
                # Legacy format
                r = AgentResponse(
                    error=error_obj,
                    response=None,
                )
            await respond(r)
    @staticmethod
--- a/trustgraph-flow/trustgraph/agent/react/streaming_parser.py
+++ b/trustgraph-flow/trustgraph/agent/react/streaming_parser.py
@ -0,0 +1,352 @@
 """
 Streaming parser for ReAct responses.
 This parser handles text chunks from LLM streaming responses and parses them
 into ReAct format (Thought/Action/Args or Thought/Final Answer). It maintains
 state across chunk boundaries to handle cases where delimiters or JSON are split.
 Key challenges:
 - Delimiters may be split across chunks: "Tho" + "ught:" or "Final An" + "swer:"
 - JSON arguments may be split: '{"loc' + 'ation": "NYC"}'
 - Need to emit thought/answer chunks as they arrive for streaming
 """
 import json
 import logging
 import re
 from enum import Enum
 from typing import Optional, Callable, Any
 from . types import Action, Final
 logger = logging.getLogger(__name__)
 class ParserState(Enum):
    """States for the streaming ReAct parser state machine"""
    INITIAL = "initial"              # Waiting for first content
    THOUGHT = "thought"              # Accumulating thought content
    ACTION = "action"                # Found "Action:", collecting action name
    ARGS = "args"                    # Found "Args:", collecting JSON arguments
    FINAL_ANSWER = "final_answer"    # Found "Final Answer:", collecting answer
    COMPLETE = "complete"            # Parsing complete, object ready
 class StreamingReActParser:
    """
    Stateful parser for streaming ReAct responses.
    Expected format:
        Thought: [reasoning about what to do next]
        Action: [tool_name]
        Args: {
            "param": "value"
        }
    OR
        Thought: [reasoning about the final answer]
        Final Answer: [the answer]
    Usage:
        parser = StreamingReActParser(
            on_thought_chunk=lambda chunk: print(f"Thought: {chunk}"),
            on_answer_chunk=lambda chunk: print(f"Answer: {chunk}"),
        )
        for chunk in llm_stream:
            parser.feed(chunk)
            if parser.is_complete():
                result = parser.get_result()
                break
    """
    # Delimiters we're looking for
    THOUGHT_DELIMITER = "Thought:"
    ACTION_DELIMITER = "Action:"
    ARGS_DELIMITER = "Args:"
    FINAL_ANSWER_DELIMITER = "Final Answer:"
    # Maximum buffer size for delimiter detection (longest delimiter + safety margin)
    MAX_DELIMITER_BUFFER = 20
    def __init__(
        self,
        on_thought_chunk: Optional[Callable[[str], Any]] = None,
        on_answer_chunk: Optional[Callable[[str], Any]] = None,
    ):
        """
        Initialize streaming parser.
        Args:
            on_thought_chunk: Callback for thought text chunks as they arrive
            on_answer_chunk: Callback for final answer text chunks as they arrive
        """
        self.on_thought_chunk = on_thought_chunk
        self.on_answer_chunk = on_answer_chunk
        # Parser state
        self.state = ParserState.INITIAL
        # Buffers for accumulating content
        self.line_buffer = ""           # For detecting delimiters across chunk boundaries
        self.thought_buffer = ""        # Accumulated thought text
        self.action_buffer = ""         # Action name
        self.args_buffer = ""           # JSON arguments text
        self.answer_buffer = ""         # Final answer text
        # JSON parsing state for Args
        self.brace_count = 0
        self.args_started = False
        # Result object (Action or Final)
        self.result = None
    def feed(self, chunk: str) -> None:
        """
        Feed a text chunk to the parser.
        Args:
            chunk: Text chunk from LLM stream
        """
        if self.state == ParserState.COMPLETE:
            return  # Already complete, ignore further chunks
        # Add chunk to line buffer for delimiter detection
        self.line_buffer += chunk
        # Remove markdown code blocks if present
        self.line_buffer = re.sub(r'^```[^\n]*\n', '', self.line_buffer)
        self.line_buffer = re.sub(r'\n```$', '', self.line_buffer)
        # Process based on current state
        # Track previous state to detect if we're making progress
        while self.line_buffer and self.state != ParserState.COMPLETE:
            prev_buffer_len = len(self.line_buffer)
            prev_state = self.state
            if self.state == ParserState.INITIAL:
                self._process_initial()
            elif self.state == ParserState.THOUGHT:
                self._process_thought()
            elif self.state == ParserState.ACTION:
                self._process_action()
            elif self.state == ParserState.ARGS:
                self._process_args()
            elif self.state == ParserState.FINAL_ANSWER:
                self._process_final_answer()
            # If no progress was made (buffer unchanged AND state unchanged), break
            # to avoid infinite loop. We'll process more when the next chunk arrives.
            if len(self.line_buffer) == prev_buffer_len and self.state == prev_state:
                break
    def _process_initial(self) -> None:
        """Process INITIAL state - looking for 'Thought:' delimiter"""
        idx = self.line_buffer.find(self.THOUGHT_DELIMITER)
        if idx >= 0:
            # Found thought delimiter
            # Discard any content before it and strip leading whitespace after delimiter
            self.line_buffer = self.line_buffer[idx + len(self.THOUGHT_DELIMITER):].lstrip()
            self.state = ParserState.THOUGHT
        elif len(self.line_buffer) >= self.MAX_DELIMITER_BUFFER:
            # Buffer getting too large, probably junk before thought
            # Keep only the tail that might contain partial delimiter
            self.line_buffer = self.line_buffer[-self.MAX_DELIMITER_BUFFER:]
    def _process_thought(self) -> None:
        """Process THOUGHT state - accumulating thought content"""
        # Check for Action or Final Answer delimiter
        action_idx = self.line_buffer.find(self.ACTION_DELIMITER)
        final_idx = self.line_buffer.find(self.FINAL_ANSWER_DELIMITER)
        # Find which delimiter comes first (if any)
        next_delimiter_idx = -1
        next_state = None
        if action_idx >= 0 and (final_idx < 0 or action_idx < final_idx):
            next_delimiter_idx = action_idx
            next_state = ParserState.ACTION
            delimiter_len = len(self.ACTION_DELIMITER)
        elif final_idx >= 0:
            next_delimiter_idx = final_idx
            next_state = ParserState.FINAL_ANSWER
            delimiter_len = len(self.FINAL_ANSWER_DELIMITER)
        if next_delimiter_idx >= 0:
            # Found next delimiter
            thought_chunk = self.line_buffer[:next_delimiter_idx].strip()
            if thought_chunk:
                self.thought_buffer += thought_chunk
                if self.on_thought_chunk:
                    self.on_thought_chunk(thought_chunk)
            self.line_buffer = self.line_buffer[next_delimiter_idx + delimiter_len:].lstrip()
            self.state = next_state
        else:
            # No delimiter found yet
            # Keep tail in buffer (might contain partial delimiter)
            # Emit the rest as thought chunk
            if len(self.line_buffer) > self.MAX_DELIMITER_BUFFER:
                emittable = self.line_buffer[:-self.MAX_DELIMITER_BUFFER]
                self.thought_buffer += emittable
                if self.on_thought_chunk:
                    self.on_thought_chunk(emittable)
                self.line_buffer = self.line_buffer[-self.MAX_DELIMITER_BUFFER:]
    def _process_action(self) -> None:
        """Process ACTION state - collecting action name"""
        # Action name is on one line (or at least until newline or Args:)
        newline_idx = self.line_buffer.find('\n')
        args_idx = self.line_buffer.find(self.ARGS_DELIMITER)
        # Find which comes first
        if args_idx >= 0 and (newline_idx < 0 or args_idx < newline_idx):
            # Args delimiter found first
            # Only set action_buffer if not already set (to avoid overwriting with empty string)
            if not self.action_buffer:
                self.action_buffer = self.line_buffer[:args_idx].strip().strip('"')
            self.line_buffer = self.line_buffer[args_idx + len(self.ARGS_DELIMITER):].lstrip()
            self.state = ParserState.ARGS
        elif newline_idx >= 0:
            # Newline found, action name complete
            # Only set action_buffer if not already set
            if not self.action_buffer:
                self.action_buffer = self.line_buffer[:newline_idx].strip().strip('"')
            self.line_buffer = self.line_buffer[newline_idx + 1:]
            # Stay in ACTION state or move to ARGS if we find delimiter
            # Actually, check if next line has Args:
            if self.line_buffer.lstrip().startswith(self.ARGS_DELIMITER):
                args_start = self.line_buffer.find(self.ARGS_DELIMITER)
                self.line_buffer = self.line_buffer[args_start + len(self.ARGS_DELIMITER):].lstrip()
                self.state = ParserState.ARGS
        else:
            # Not enough content yet, keep buffering
            # But if buffer is getting large, action name is probably complete
            if len(self.line_buffer) > 100:
                self.action_buffer = self.line_buffer.strip().strip('"')
                self.line_buffer = ""
                # Assume Args comes next, but we need more content
                self.state = ParserState.ARGS
    def _process_args(self) -> None:
        """Process ARGS state - collecting JSON arguments"""
        # Process character by character to track brace matching
        i = 0
        while i < len(self.line_buffer):
            char = self.line_buffer[i]
            self.args_buffer += char
            if char == '{':
                self.brace_count += 1
                self.args_started = True
            elif char == '}':
                self.brace_count -= 1
            # Check if JSON is complete
            if self.args_started and self.brace_count == 0:
                # JSON complete, try to parse
                try:
                    args_dict = json.loads(self.args_buffer.strip())
                    # Success! Create Action result
                    self.result = Action(
                        thought=self.thought_buffer.strip(),
                        name=self.action_buffer,
                        arguments=args_dict,
                        observation=""
                    )
                    self.state = ParserState.COMPLETE
                    self.line_buffer = ""  # Clear buffer
                    return
                except json.JSONDecodeError as e:
                    logger.error(f"Failed to parse JSON args: {self.args_buffer}")
                    raise ValueError(f"Invalid JSON in Args: {e}")
            i += 1
        # Consumed entire buffer, clear it and wait for more chunks
        self.line_buffer = ""
    def _process_final_answer(self) -> None:
        """Process FINAL_ANSWER state - collecting final answer"""
        # For final answer, we consume everything until we decide we're done
        # In streaming mode, we can't know when answer is complete until stream ends
        # So we emit chunks and accumulate
        # Check if this might be JSON
        is_json = self.answer_buffer.strip().startswith('{') or \
                  self.line_buffer.strip().startswith('{')
        if is_json:
            # Handle JSON final answer
            self.answer_buffer += self.line_buffer
            # Count braces to detect completion
            brace_count = self.answer_buffer.count('{') - self.answer_buffer.count('}')
            if brace_count == 0 and '{' in self.answer_buffer:
                # JSON might be complete
                # Note: We can't be 100% sure without trying to parse
                # But in streaming mode, we'll finish when stream ends
                pass
            # Emit chunk
            if self.on_answer_chunk:
                self.on_answer_chunk(self.line_buffer)
            self.line_buffer = ""
        else:
            # Regular text answer - emit everything
            if self.line_buffer:
                self.answer_buffer += self.line_buffer
                if self.on_answer_chunk:
                    self.on_answer_chunk(self.line_buffer)
                self.line_buffer = ""
    def finalize(self) -> None:
        """
        Call this when the stream is complete to finalize parsing.
        This handles any remaining buffered content.
        """
        if self.state == ParserState.COMPLETE:
            return
        # Flush any remaining thought chunks
        if self.state == ParserState.THOUGHT and self.line_buffer:
            self.thought_buffer += self.line_buffer
            if self.on_thought_chunk:
                self.on_thought_chunk(self.line_buffer)
            self.line_buffer = ""
        # Finalize final answer
        if self.state == ParserState.FINAL_ANSWER:
            # Flush any remaining answer content
            if self.line_buffer:
                self.answer_buffer += self.line_buffer
                if self.on_answer_chunk:
                    self.on_answer_chunk(self.line_buffer)
                self.line_buffer = ""
            # Create Final result
            self.result = Final(
                thought=self.thought_buffer.strip(),
                final=self.answer_buffer.strip()
            )
            self.state = ParserState.COMPLETE
        # If we're in other states, something went wrong
        if self.state not in [ParserState.COMPLETE, ParserState.FINAL_ANSWER]:
            if self.thought_buffer:
                raise ValueError(
                    f"Stream ended in {self.state.value} state with incomplete parsing. "
                    f"Thought: {self.thought_buffer[:100]}..."
                )
            else:
                raise ValueError(f"Stream ended in {self.state.value} state with no content")
    def is_complete(self) -> bool:
        """Check if parsing is complete"""
        return self.state == ParserState.COMPLETE
    def get_result(self) -> Optional[Action | Final]:
        """Get the parsed result (Action or Final)"""
        return self.result
--- a/trustgraph-flow/trustgraph/librarian/blob_store.py
+++ b/trustgraph-flow/trustgraph/librarian/blob_store.py
@ -19,7 +19,7 @@ class BlobStore:
        self.minio = Minio(
-            minio_host,
+            endpoint = minio_host,
            access_key = minio_access_key,
            secret_key = minio_secret_key,
            secure = False,
@ -34,9 +34,9 @@ class BlobStore:
    def ensure_bucket(self):
        # Make the bucket if it doesn't exist.
-        found = self.minio.bucket_exists(self.bucket_name)
+        found = self.minio.bucket_exists(bucket_name=self.bucket_name)
        if not found:
-            self.minio.make_bucket(self.bucket_name)
+            self.minio.make_bucket(bucket_name=self.bucket_name)
            logger.info(f"Created bucket {self.bucket_name}")
        else:
            logger.debug(f"Bucket {self.bucket_name} already exists")
--- a/trustgraph-flow/trustgraph/model/text_completion/azure/llm.py
+++ b/trustgraph-flow/trustgraph/model/text_completion/azure/llm.py
@ -11,7 +11,7 @@ import os
 import logging
 from .... exceptions import TooManyRequests
-from .... base import LlmService, LlmResult
+from .... base import LlmService, LlmResult, LlmChunk
 # Module logger
 logger = logging.getLogger(__name__)
@ -55,7 +55,7 @@ class Processor(LlmService):
        self.max_output = max_output
        self.default_model = model
-    def build_prompt(self, system, content, temperature=None):
+    def build_prompt(self, system, content, temperature=None, stream=False):
        # Use provided temperature or fall back to default
        effective_temperature = temperature if temperature is not None else self.temperature
@ -73,6 +73,9 @@ class Processor(LlmService):
            "top_p": 1
        }
        if stream:
            data["stream"] = True
        body = json.dumps(data)
        return body
@ -157,6 +160,84 @@ class Processor(LlmService):
        logger.debug("Azure LLM processing complete")
    def supports_streaming(self):
        """Azure serverless endpoints support streaming"""
        return True
    async def generate_content_stream(self, system, prompt, model=None, temperature=None):
        """Stream content generation from Azure serverless endpoint"""
        model_name = model or self.default_model
        effective_temperature = temperature if temperature is not None else self.temperature
        logger.debug(f"Using model (streaming): {model_name}")
        logger.debug(f"Using temperature: {effective_temperature}")
        try:
            body = self.build_prompt(system, prompt, effective_temperature, stream=True)
            url = self.endpoint
            api_key = self.token
            headers = {
                'Content-Type': 'application/json',
                'Authorization': f'Bearer {api_key}'
            }
            response = requests.post(url, data=body, headers=headers, stream=True)
            if response.status_code == 429:
                raise TooManyRequests()
            if response.status_code != 200:
                raise RuntimeError("LLM failure")
            # Parse SSE stream
            for line in response.iter_lines():
                if line:
                    line = line.decode('utf-8').strip()
                    if line.startswith('data: '):
                        data = line[6:]  # Remove 'data: ' prefix
                        if data == '[DONE]':
                            break
                        try:
                            chunk_data = json.loads(data)
                            if 'choices' in chunk_data and len(chunk_data['choices']) > 0:
                                delta = chunk_data['choices'][0].get('delta', {})
                                content = delta.get('content')
                                if content:
                                    yield LlmChunk(
                                        text=content,
                                        in_token=None,
                                        out_token=None,
                                        model=model_name,
                                        is_final=False
                                    )
                        except json.JSONDecodeError:
                            logger.warning(f"Failed to parse chunk: {data}")
                            continue
            # Send final chunk
            yield LlmChunk(
                text="",
                in_token=None,
                out_token=None,
                model=model_name,
                is_final=True
            )
            logger.debug("Streaming complete")
        except TooManyRequests:
            logger.warning("Rate limit exceeded during streaming")
            raise TooManyRequests()
        except Exception as e:
            logger.error(f"Azure streaming exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    @staticmethod
    def add_args(parser):
--- a/trustgraph-flow/trustgraph/model/text_completion/azure_openai/llm.py
+++ b/trustgraph-flow/trustgraph/model/text_completion/azure_openai/llm.py
@ -14,7 +14,7 @@ import logging
 logger = logging.getLogger(__name__)
 from .... exceptions import TooManyRequests
-from .... base import LlmService, LlmResult
+from .... base import LlmService, LlmResult, LlmChunk
 default_ident = "text-completion"
@ -125,6 +125,75 @@ class Processor(LlmService):
        logger.debug("Azure OpenAI LLM processing complete")
    def supports_streaming(self):
        """Azure OpenAI supports streaming"""
        return True
    async def generate_content_stream(self, system, prompt, model=None, temperature=None):
        """
        Stream content generation from Azure OpenAI.
        Yields LlmChunk objects with is_final=True on the last chunk.
        """
        # Use provided model or fall back to default
        model_name = model or self.default_model
        # Use provided temperature or fall back to default
        effective_temperature = temperature if temperature is not None else self.temperature
        logger.debug(f"Using model (streaming): {model_name}")
        logger.debug(f"Using temperature: {effective_temperature}")
        prompt = system + "\n\n" + prompt
        try:
            response = self.openai.chat.completions.create(
                model=model_name,
                messages=[
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "text",
                                "text": prompt
                            }
                        ]
                    }
                ],
                temperature=effective_temperature,
                max_tokens=self.max_output,
                top_p=1,
                stream=True  # Enable streaming
            )
            # Stream chunks
            for chunk in response:
                if chunk.choices and chunk.choices[0].delta.content:
                    yield LlmChunk(
                        text=chunk.choices[0].delta.content,
                        in_token=None,
                        out_token=None,
                        model=model_name,
                        is_final=False
                    )
            # Send final chunk
            yield LlmChunk(
                text="",
                in_token=None,
                out_token=None,
                model=model_name,
                is_final=True
            )
            logger.debug("Streaming complete")
        except RateLimitError:
            logger.warning("Rate limit exceeded during streaming")
            raise TooManyRequests()
        except Exception as e:
            logger.error(f"Azure OpenAI streaming exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    @staticmethod
    def add_args(parser):
--- a/trustgraph-flow/trustgraph/model/text_completion/claude/llm.py
+++ b/trustgraph-flow/trustgraph/model/text_completion/claude/llm.py
@ -9,7 +9,7 @@ import os
 import logging
 from .... exceptions import TooManyRequests
-from .... base import LlmService, LlmResult
+from .... base import LlmService, LlmResult, LlmChunk
 # Module logger
 logger = logging.getLogger(__name__)
@ -106,6 +106,65 @@ class Processor(LlmService):
            logger.error(f"Claude LLM exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    def supports_streaming(self):
        """Claude/Anthropic supports streaming"""
        return True
    async def generate_content_stream(self, system, prompt, model=None, temperature=None):
        """Stream content generation from Claude"""
        model_name = model or self.default_model
        effective_temperature = temperature if temperature is not None else self.temperature
        logger.debug(f"Using model (streaming): {model_name}")
        logger.debug(f"Using temperature: {effective_temperature}")
        try:
            with self.claude.messages.stream(
                model=model_name,
                max_tokens=self.max_output,
                temperature=effective_temperature,
                system=system,
                messages=[
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "text",
                                "text": prompt
                            }
                        ]
                    }
                ]
            ) as stream:
                for text in stream.text_stream:
                    yield LlmChunk(
                        text=text,
                        in_token=None,
                        out_token=None,
                        model=model_name,
                        is_final=False
                    )
                # Get final message for token counts
                final_message = stream.get_final_message()
                yield LlmChunk(
                    text="",
                    in_token=final_message.usage.input_tokens,
                    out_token=final_message.usage.output_tokens,
                    model=model_name,
                    is_final=True
                )
            logger.debug("Streaming complete")
        except anthropic.RateLimitError:
            logger.warning("Rate limit exceeded during streaming")
            raise TooManyRequests()
        except Exception as e:
            logger.error(f"Claude streaming exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    @staticmethod
    def add_args(parser):
--- a/trustgraph-flow/trustgraph/model/text_completion/cohere/llm.py
+++ b/trustgraph-flow/trustgraph/model/text_completion/cohere/llm.py
@ -13,7 +13,7 @@ import logging
 logger = logging.getLogger(__name__)
 from .... exceptions import TooManyRequests
-from .... base import LlmService, LlmResult
+from .... base import LlmService, LlmResult, LlmChunk
 default_ident = "text-completion"
@ -98,6 +98,68 @@ class Processor(LlmService):
            logger.error(f"Cohere LLM exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    def supports_streaming(self):
        """Cohere supports streaming"""
        return True
    async def generate_content_stream(self, system, prompt, model=None, temperature=None):
        """Stream content generation from Cohere"""
        model_name = model or self.default_model
        effective_temperature = temperature if temperature is not None else self.temperature
        logger.debug(f"Using model (streaming): {model_name}")
        logger.debug(f"Using temperature: {effective_temperature}")
        try:
            stream = self.cohere.chat_stream(
                model=model_name,
                message=prompt,
                preamble=system,
                temperature=effective_temperature,
                chat_history=[],
                prompt_truncation='auto',
                connectors=[]
            )
            total_input_tokens = 0
            total_output_tokens = 0
            for event in stream:
                if event.event_type == "text-generation":
                    if hasattr(event, 'text') and event.text:
                        yield LlmChunk(
                            text=event.text,
                            in_token=None,
                            out_token=None,
                            model=model_name,
                            is_final=False
                        )
                elif event.event_type == "stream-end":
                    # Extract token counts from final event
                    if hasattr(event, 'response') and hasattr(event.response, 'meta'):
                        if hasattr(event.response.meta, 'billed_units'):
                            total_input_tokens = int(event.response.meta.billed_units.input_tokens)
                            total_output_tokens = int(event.response.meta.billed_units.output_tokens)
            # Send final chunk with token counts
            yield LlmChunk(
                text="",
                in_token=total_input_tokens,
                out_token=total_output_tokens,
                model=model_name,
                is_final=True
            )
            logger.debug("Streaming complete")
        except cohere.TooManyRequestsError:
            logger.warning("Rate limit exceeded during streaming")
            raise TooManyRequests()
        except Exception as e:
            logger.error(f"Cohere streaming exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    @staticmethod
    def add_args(parser):
--- a/trustgraph-flow/trustgraph/model/text_completion/googleaistudio/llm.py
+++ b/trustgraph-flow/trustgraph/model/text_completion/googleaistudio/llm.py
@ -23,7 +23,7 @@ import logging
 logger = logging.getLogger(__name__)
 from .... exceptions import TooManyRequests
-from .... base import LlmService, LlmResult
+from .... base import LlmService, LlmResult, LlmChunk
 default_ident = "text-completion"
@ -159,6 +159,67 @@ class Processor(LlmService):
            logger.error(f"GoogleAIStudio LLM exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    def supports_streaming(self):
        """Google AI Studio supports streaming"""
        return True
    async def generate_content_stream(self, system, prompt, model=None, temperature=None):
        """Stream content generation from Google AI Studio"""
        model_name = model or self.default_model
        effective_temperature = temperature if temperature is not None else self.temperature
        logger.debug(f"Using model (streaming): {model_name}")
        logger.debug(f"Using temperature: {effective_temperature}")
        generation_config = self._get_or_create_config(model_name, effective_temperature)
        generation_config.system_instruction = system
        try:
            response = self.client.models.generate_content_stream(
                model=model_name,
                config=generation_config,
                contents=prompt,
            )
            total_input_tokens = 0
            total_output_tokens = 0
            for chunk in response:
                if hasattr(chunk, 'text') and chunk.text:
                    yield LlmChunk(
                        text=chunk.text,
                        in_token=None,
                        out_token=None,
                        model=model_name,
                        is_final=False
                    )
                # Accumulate token counts if available
                if hasattr(chunk, 'usage_metadata'):
                    if hasattr(chunk.usage_metadata, 'prompt_token_count'):
                        total_input_tokens = int(chunk.usage_metadata.prompt_token_count)
                    if hasattr(chunk.usage_metadata, 'candidates_token_count'):
                        total_output_tokens = int(chunk.usage_metadata.candidates_token_count)
            # Send final chunk with token counts
            yield LlmChunk(
                text="",
                in_token=total_input_tokens,
                out_token=total_output_tokens,
                model=model_name,
                is_final=True
            )
            logger.debug("Streaming complete")
        except ResourceExhausted:
            logger.warning("Rate limit exceeded during streaming")
            raise TooManyRequests()
        except Exception as e:
            logger.error(f"GoogleAIStudio streaming exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    @staticmethod
    def add_args(parser):
--- a/trustgraph-flow/trustgraph/model/text_completion/llamafile/llm.py
+++ b/trustgraph-flow/trustgraph/model/text_completion/llamafile/llm.py
@ -12,7 +12,7 @@ import logging
 logger = logging.getLogger(__name__)
 from .... exceptions import TooManyRequests
-from .... base import LlmService, LlmResult
+from .... base import LlmService, LlmResult, LlmChunk
 default_ident = "text-completion"
@ -102,6 +102,57 @@ class Processor(LlmService):
            logger.error(f"Llamafile LLM exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    def supports_streaming(self):
        """LlamaFile supports streaming"""
        return True
    async def generate_content_stream(self, system, prompt, model=None, temperature=None):
        """Stream content generation from LlamaFile"""
        model_name = model or self.default_model
        effective_temperature = temperature if temperature is not None else self.temperature
        logger.debug(f"Using model (streaming): {model_name}")
        logger.debug(f"Using temperature: {effective_temperature}")
        prompt = system + "\n\n" + prompt
        try:
            response = self.openai.chat.completions.create(
                model=model_name,
                messages=[{"role": "user", "content": prompt}],
                temperature=effective_temperature,
                max_tokens=self.max_output,
                top_p=1,
                frequency_penalty=0,
                presence_penalty=0,
                response_format={"type": "text"},
                stream=True
            )
            for chunk in response:
                if chunk.choices and chunk.choices[0].delta.content:
                    yield LlmChunk(
                        text=chunk.choices[0].delta.content,
                        in_token=None,
                        out_token=None,
                        model=model_name,
                        is_final=False
                    )
            yield LlmChunk(
                text="",
                in_token=None,
                out_token=None,
                model=model_name,
                is_final=True
            )
            logger.debug("Streaming complete")
        except Exception as e:
            logger.error(f"LlamaFile streaming exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    @staticmethod
    def add_args(parser):
--- a/trustgraph-flow/trustgraph/model/text_completion/lmstudio/llm.py
+++ b/trustgraph-flow/trustgraph/model/text_completion/lmstudio/llm.py
@ -12,7 +12,7 @@ import logging
 logger = logging.getLogger(__name__)
 from .... exceptions import TooManyRequests
-from .... base import LlmService, LlmResult
+from .... base import LlmService, LlmResult, LlmChunk
 default_ident = "text-completion"
@ -106,6 +106,57 @@ class Processor(LlmService):
            logger.error(f"LMStudio LLM exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    def supports_streaming(self):
        """LM Studio supports streaming"""
        return True
    async def generate_content_stream(self, system, prompt, model=None, temperature=None):
        """Stream content generation from LM Studio"""
        model_name = model or self.default_model
        effective_temperature = temperature if temperature is not None else self.temperature
        logger.debug(f"Using model (streaming): {model_name}")
        logger.debug(f"Using temperature: {effective_temperature}")
        prompt = system + "\n\n" + prompt
        try:
            response = self.openai.chat.completions.create(
                model=model_name,
                messages=[{"role": "user", "content": prompt}],
                temperature=effective_temperature,
                max_tokens=self.max_output,
                top_p=1,
                frequency_penalty=0,
                presence_penalty=0,
                response_format={"type": "text"},
                stream=True
            )
            for chunk in response:
                if chunk.choices and chunk.choices[0].delta.content:
                    yield LlmChunk(
                        text=chunk.choices[0].delta.content,
                        in_token=None,
                        out_token=None,
                        model=model_name,
                        is_final=False
                    )
            yield LlmChunk(
                text="",
                in_token=None,
                out_token=None,
                model=model_name,
                is_final=True
            )
            logger.debug("Streaming complete")
        except Exception as e:
            logger.error(f"LMStudio streaming exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    @staticmethod
    def add_args(parser):
--- a/trustgraph-flow/trustgraph/model/text_completion/mistral/llm.py
+++ b/trustgraph-flow/trustgraph/model/text_completion/mistral/llm.py
@ -12,7 +12,7 @@ import logging
 logger = logging.getLogger(__name__)
 from .... exceptions import TooManyRequests
-from .... base import LlmService, LlmResult
+from .... base import LlmService, LlmResult, LlmChunk
 default_ident = "text-completion"
@ -120,6 +120,67 @@ class Processor(LlmService):
            logger.error(f"Mistral LLM exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    def supports_streaming(self):
        """Mistral supports streaming"""
        return True
    async def generate_content_stream(self, system, prompt, model=None, temperature=None):
        """Stream content generation from Mistral"""
        model_name = model or self.default_model
        effective_temperature = temperature if temperature is not None else self.temperature
        logger.debug(f"Using model (streaming): {model_name}")
        logger.debug(f"Using temperature: {effective_temperature}")
        prompt = system + "\n\n" + prompt
        try:
            stream = self.mistral.chat.stream(
                model=model_name,
                messages=[
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "text",
                                "text": prompt
                            }
                        ]
                    }
                ],
                temperature=effective_temperature,
                max_tokens=self.max_output,
                top_p=1,
                frequency_penalty=0,
                presence_penalty=0,
                response_format={"type": "text"}
            )
            for chunk in stream:
                if chunk.data.choices and chunk.data.choices[0].delta.content:
                    yield LlmChunk(
                        text=chunk.data.choices[0].delta.content,
                        in_token=None,
                        out_token=None,
                        model=model_name,
                        is_final=False
                    )
            # Send final chunk
            yield LlmChunk(
                text="",
                in_token=None,
                out_token=None,
                model=model_name,
                is_final=True
            )
            logger.debug("Streaming complete")
        except Exception as e:
            logger.error(f"Mistral streaming exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    @staticmethod
    def add_args(parser):
--- a/trustgraph-flow/trustgraph/model/text_completion/ollama/llm.py
+++ b/trustgraph-flow/trustgraph/model/text_completion/ollama/llm.py
@ -12,7 +12,7 @@ import logging
 logger = logging.getLogger(__name__)
 from .... exceptions import TooManyRequests
-from .... base import LlmService, LlmResult
+from .... base import LlmService, LlmResult, LlmChunk
 default_ident = "text-completion"
@ -79,6 +79,62 @@ class Processor(LlmService):
            logger.error(f"Ollama LLM exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    def supports_streaming(self):
        """Ollama supports streaming"""
        return True
    async def generate_content_stream(self, system, prompt, model=None, temperature=None):
        """Stream content generation from Ollama"""
        model_name = model or self.default_model
        effective_temperature = temperature if temperature is not None else self.temperature
        logger.debug(f"Using model (streaming): {model_name}")
        logger.debug(f"Using temperature: {effective_temperature}")
        prompt = system + "\n\n" + prompt
        try:
            stream = self.llm.generate(
                model_name,
                prompt,
                options={'temperature': effective_temperature},
                stream=True
            )
            total_input_tokens = 0
            total_output_tokens = 0
            for chunk in stream:
                if 'response' in chunk and chunk['response']:
                    yield LlmChunk(
                        text=chunk['response'],
                        in_token=None,
                        out_token=None,
                        model=model_name,
                        is_final=False
                    )
                # Accumulate token counts if available
                if 'prompt_eval_count' in chunk:
                    total_input_tokens = int(chunk['prompt_eval_count'])
                if 'eval_count' in chunk:
                    total_output_tokens = int(chunk['eval_count'])
            # Send final chunk with token counts
            yield LlmChunk(
                text="",
                in_token=total_input_tokens,
                out_token=total_output_tokens,
                model=model_name,
                is_final=True
            )
            logger.debug("Streaming complete")
        except Exception as e:
            logger.error(f"Ollama streaming exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    @staticmethod
    def add_args(parser):
--- a/trustgraph-flow/trustgraph/model/text_completion/openai/llm.py
+++ b/trustgraph-flow/trustgraph/model/text_completion/openai/llm.py
@ -9,7 +9,7 @@ import os
 import logging
 from .... exceptions import TooManyRequests
-from .... base import LlmService, LlmResult
+from .... base import LlmService, LlmResult, LlmChunk
 # Module logger
 logger = logging.getLogger(__name__)
@ -118,6 +118,75 @@ class Processor(LlmService):
            logger.error(f"OpenAI LLM exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    def supports_streaming(self):
        """OpenAI supports streaming"""
        return True
    async def generate_content_stream(self, system, prompt, model=None, temperature=None):
        """
        Stream content generation from OpenAI.
        Yields LlmChunk objects with is_final=True on the last chunk.
        """
        # Use provided model or fall back to default
        model_name = model or self.default_model
        # Use provided temperature or fall back to default
        effective_temperature = temperature if temperature is not None else self.temperature
        logger.debug(f"Using model (streaming): {model_name}")
        logger.debug(f"Using temperature: {effective_temperature}")
        prompt = system + "\n\n" + prompt
        try:
            response = self.openai.chat.completions.create(
                model=model_name,
                messages=[
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "text",
                                "text": prompt
                            }
                        ]
                    }
                ],
                temperature=effective_temperature,
                max_tokens=self.max_output,
                stream=True  # Enable streaming
            )
            # Stream chunks
            for chunk in response:
                if chunk.choices and chunk.choices[0].delta.content:
                    yield LlmChunk(
                        text=chunk.choices[0].delta.content,
                        in_token=None,
                        out_token=None,
                        model=model_name,
                        is_final=False
                    )
            # Note: OpenAI doesn't provide token counts in streaming mode
            # Send final chunk without token counts
            yield LlmChunk(
                text="",
                in_token=None,
                out_token=None,
                model=model_name,
                is_final=True
            )
            logger.debug("Streaming complete")
        except RateLimitError:
            logger.warning("Hit rate limit during streaming")
            raise TooManyRequests()
        except Exception as e:
            logger.error(f"OpenAI streaming exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    @staticmethod
    def add_args(parser):
--- a/trustgraph-flow/trustgraph/model/text_completion/tgi/llm.py
+++ b/trustgraph-flow/trustgraph/model/text_completion/tgi/llm.py
@ -12,7 +12,7 @@ import logging
 logger = logging.getLogger(__name__)
 from .... exceptions import TooManyRequests
-from .... base import LlmService, LlmResult
+from .... base import LlmService, LlmResult, LlmChunk
 default_ident = "text-completion"
@ -121,6 +121,100 @@ class Processor(LlmService):
            logger.error(f"TGI LLM exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    def supports_streaming(self):
        """TGI supports streaming"""
        return True
    async def generate_content_stream(self, system, prompt, model=None, temperature=None):
        """Stream content generation from TGI"""
        model_name = model or self.default_model
        effective_temperature = temperature if temperature is not None else self.temperature
        logger.debug(f"Using model (streaming): {model_name}")
        logger.debug(f"Using temperature: {effective_temperature}")
        headers = {
            "Content-Type": "application/json",
        }
        request = {
            "model": model_name,
            "messages": [
                {
                    "role": "system",
                    "content": system,
                },
                {
                    "role": "user",
                    "content": prompt,
                }
            ],
            "max_tokens": self.max_output,
            "temperature": effective_temperature,
            "stream": True,
        }
        try:
            url = f"{self.base_url}/chat/completions"
            async with self.session.post(
                    url,
                    headers=headers,
                    json=request,
            ) as response:
                if response.status != 200:
                    raise RuntimeError("Bad status: " + str(response.status))
                # Parse SSE stream
                async for line in response.content:
                    line = line.decode('utf-8').strip()
                    if not line:
                        continue
                    if line.startswith('data: '):
                        data = line[6:]  # Remove 'data: ' prefix
                        if data == '[DONE]':
                            break
                        try:
                            import json
                            chunk_data = json.loads(data)
                            # Extract text from chunk
                            if 'choices' in chunk_data and len(chunk_data['choices']) > 0:
                                choice = chunk_data['choices'][0]
                                if 'delta' in choice and 'content' in choice['delta']:
                                    content = choice['delta']['content']
                                    if content:
                                        yield LlmChunk(
                                            text=content,
                                            in_token=None,
                                            out_token=None,
                                            model=model_name,
                                            is_final=False
                                        )
                        except json.JSONDecodeError:
                            logger.warning(f"Failed to parse chunk: {data}")
                            continue
                # Send final chunk
                yield LlmChunk(
                    text="",
                    in_token=None,
                    out_token=None,
                    model=model_name,
                    is_final=True
                )
                logger.debug("Streaming complete")
        except Exception as e:
            logger.error(f"TGI streaming exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    @staticmethod
    def add_args(parser):
--- a/trustgraph-flow/trustgraph/model/text_completion/vllm/llm.py
+++ b/trustgraph-flow/trustgraph/model/text_completion/vllm/llm.py
@ -12,7 +12,7 @@ import logging
 logger = logging.getLogger(__name__)
 from .... exceptions import TooManyRequests
-from .... base import LlmService, LlmResult
+from .... base import LlmService, LlmResult, LlmChunk
 default_ident = "text-completion"
@ -113,6 +113,89 @@ class Processor(LlmService):
            logger.error(f"vLLM LLM exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    def supports_streaming(self):
        """vLLM supports streaming"""
        return True
    async def generate_content_stream(self, system, prompt, model=None, temperature=None):
        """Stream content generation from vLLM"""
        model_name = model or self.default_model
        effective_temperature = temperature if temperature is not None else self.temperature
        logger.debug(f"Using model (streaming): {model_name}")
        logger.debug(f"Using temperature: {effective_temperature}")
        headers = {
            "Content-Type": "application/json",
        }
        request = {
            "model": model_name,
            "prompt": system + "\n\n" + prompt,
            "max_tokens": self.max_output,
            "temperature": effective_temperature,
            "stream": True,
        }
        try:
            url = f"{self.base_url}/completions"
            async with self.session.post(
                    url,
                    headers=headers,
                    json=request,
            ) as response:
                if response.status != 200:
                    raise RuntimeError("Bad status: " + str(response.status))
                # Parse SSE stream
                async for line in response.content:
                    line = line.decode('utf-8').strip()
                    if not line:
                        continue
                    if line.startswith('data: '):
                        data = line[6:]  # Remove 'data: ' prefix
                        if data == '[DONE]':
                            break
                        try:
                            import json
                            chunk_data = json.loads(data)
                            # Extract text from chunk
                            if 'choices' in chunk_data and len(chunk_data['choices']) > 0:
                                choice = chunk_data['choices'][0]
                                if 'text' in choice and choice['text']:
                                    yield LlmChunk(
                                        text=choice['text'],
                                        in_token=None,
                                        out_token=None,
                                        model=model_name,
                                        is_final=False
                                    )
                        except json.JSONDecodeError:
                            logger.warning(f"Failed to parse chunk: {data}")
                            continue
                # Send final chunk
                yield LlmChunk(
                    text="",
                    in_token=None,
                    out_token=None,
                    model=model_name,
                    is_final=True
                )
                logger.debug("Streaming complete")
        except Exception as e:
            logger.error(f"vLLM streaming exception ({type(e).__name__}): {e}", exc_info=True)
            raise e
    @staticmethod
    def add_args(parser):
--- a/trustgraph-flow/trustgraph/prompt/template/service.py
+++ b/trustgraph-flow/trustgraph/prompt/template/service.py
@ -101,6 +101,9 @@ class Processor(FlowProcessor):
        kind = v.id
        # Check if streaming is requested
        streaming = getattr(v, 'streaming', False)
        try:
            logger.debug(f"Prompt terms: {v.terms}")
@ -109,16 +112,68 @@ class Processor(FlowProcessor):
                k: json.loads(v)
                for k, v in v.terms.items()
            }
            logger.debug(f"Handling prompt kind {kind}...")
            logger.debug(f"Handling prompt kind {kind}... (streaming={streaming})")
            # If streaming, we need to handle it differently
            if streaming:
                # For streaming, we need to intercept LLM responses
                # and forward them as they arrive
                async def llm_streaming(system, prompt):
                    logger.debug(f"System prompt: {system}")
                    logger.debug(f"User prompt: {prompt}")
                    # Use the text completion client with recipient handler
                    client = flow("text-completion-request")
                    async def forward_chunks(resp):
                        if resp.error:
                            raise RuntimeError(resp.error.message)
                        is_final = getattr(resp, 'end_of_stream', False)
                        # Always send a message if there's content OR if it's the final message
                        if resp.response or is_final:
                            # Forward each chunk immediately
                            r = PromptResponse(
                                text=resp.response if resp.response else "",
                                object=None,
                                error=None,
                                end_of_stream=is_final,
                            )
                            await flow("response").send(r, properties={"id": id})
                        # Return True when end_of_stream
                        return is_final
                    await client.request(
                        TextCompletionRequest(
                            system=system, prompt=prompt, streaming=True
                        ),
                        recipient=forward_chunks,
                        timeout=600
                    )
                    # Return empty string since we already sent all chunks
                    return ""
                try:
                    await self.manager.invoke(kind, input, llm_streaming)
                except Exception as e:
                    logger.error(f"Prompt streaming exception: {e}", exc_info=True)
                    raise e
                return
            # Non-streaming path (original behavior)
            async def llm(system, prompt):
                logger.debug(f"System prompt: {system}")
                logger.debug(f"User prompt: {prompt}")
                resp = await flow("text-completion-request").text_completion(
-                    system = system, prompt = prompt,
+                    system = system, prompt = prompt, streaming = False,
                )
                try:
@ -143,6 +198,7 @@ class Processor(FlowProcessor):
                    text=resp,
                    object=None,
                    error=None,
                    end_of_stream=True,
                )
                await flow("response").send(r, properties={"id": id})
@ -158,6 +214,7 @@ class Processor(FlowProcessor):
                    text=None,
                    object=json.dumps(resp),
                    error=None,
                    end_of_stream=True,
                )
                await flow("response").send(r, properties={"id": id})
@ -175,27 +232,13 @@ class Processor(FlowProcessor):
                    type = "llm-error",
                    message = str(e),
                ),
-                response=None,
+                text=None,
                object=None,
                end_of_stream=True,
            )
            await flow("response").send(r, properties={"id": id})
        except Exception as e:
            logger.error(f"Prompt service exception: {e}", exc_info=True)
            logger.debug("Sending error response...")
            r = PromptResponse(
                error=Error(
                    type = "llm-error",
                    message = str(e),
                ),
                response=None,
            )
            await self.send(r, properties={"id": id})
    @staticmethod
    def add_args(parser):
--- a/trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py
+++ b/trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py
@ -68,7 +68,7 @@ class DocumentRag:
    async def query(
            self, query, user="trustgraph", collection="default",
-            doc_limit=20,
+            doc_limit=20, streaming=False, chunk_callback=None,
    ):
        if self.verbose:
@ -86,10 +86,18 @@ class DocumentRag:
            logger.debug(f"Documents: {docs}")
            logger.debug(f"Query: {query}")
-        resp = await self.prompt_client.document_prompt(
+        if streaming and chunk_callback:
-            query = query,
+            resp = await self.prompt_client.document_prompt(
-            documents = docs
+                query=query,
-        )
+                documents=docs,
                streaming=True,
                chunk_callback=chunk_callback
            )
        else:
            resp = await self.prompt_client.document_prompt(
                query=query,
                documents=docs
            )
        if self.verbose:
            logger.debug("Query processing complete")
--- a/trustgraph-flow/trustgraph/retrieval/document_rag/rag.py
+++ b/trustgraph-flow/trustgraph/retrieval/document_rag/rag.py
@ -92,20 +92,56 @@ class Processor(FlowProcessor):
            else:
                doc_limit = self.doc_limit
-            response = await self.rag.query(
+            # Check if streaming is requested
-                v.query, 
+            if v.streaming:
-                user=v.user, 
+                # Define async callback for streaming chunks
-                collection=v.collection, 
+                async def send_chunk(chunk):
-                doc_limit=doc_limit
+                    await flow("response").send(
-            )
+                        DocumentRagResponse(
                            chunk=chunk,
                            end_of_stream=False,
                            response=None,
                            error=None
                        ),
                        properties={"id": id}
                    )
-            await flow("response").send(
+                # Query with streaming enabled
-                DocumentRagResponse(
+                full_response = await self.rag.query(
-                    response = response,
+                    v.query,
-                    error = None
+                    user=v.user,
-                ),
+                    collection=v.collection,
-                properties = {"id": id}
+                    doc_limit=doc_limit,
-            )
+                    streaming=True,
                    chunk_callback=send_chunk,
                )
                # Send final message with complete response
                await flow("response").send(
                    DocumentRagResponse(
                        chunk=None,
                        end_of_stream=True,
                        response=full_response,
                        error=None
                    ),
                    properties={"id": id}
                )
            else:
                # Non-streaming path (existing behavior)
                response = await self.rag.query(
                    v.query,
                    user=v.user,
                    collection=v.collection,
                    doc_limit=doc_limit
                )
                await flow("response").send(
                    DocumentRagResponse(
                        response = response,
                        error = None
                    ),
                    properties = {"id": id}
                )
            logger.info("Request processing complete")
@ -115,14 +151,21 @@ class Processor(FlowProcessor):
            logger.debug("Sending error response...")
-            await flow("response").send(
+            # Send error response with end_of_stream flag if streaming was requested
-                DocumentRagResponse(
+            error_response = DocumentRagResponse(
-                    response = None,
+                response = None,
-                    error = Error(
+                error = Error(
-                        type = "document-rag-error",
+                    type = "document-rag-error",
-                        message = str(e),
+                    message = str(e),
                    ),
                ),
            )
            # If streaming was requested, indicate stream end
            if v.streaming:
                error_response.end_of_stream = True
            await flow("response").send(
                error_response,
                properties = {"id": id}
            )
--- a/trustgraph-flow/trustgraph/retrieval/graph_rag/graph_rag.py
+++ b/trustgraph-flow/trustgraph/retrieval/graph_rag/graph_rag.py
@ -316,7 +316,7 @@ class GraphRag:
    async def query(
            self, query, user = "trustgraph", collection = "default",
            entity_limit = 50, triple_limit = 30, max_subgraph_size = 1000,
-            max_path_length = 2,
+            max_path_length = 2, streaming = False, chunk_callback = None,
    ):
        if self.verbose:
@ -337,7 +337,14 @@ class GraphRag:
            logger.debug(f"Knowledge graph: {kg}")
            logger.debug(f"Query: {query}")
-        resp = await self.prompt_client.kg_prompt(query, kg)
+        if streaming and chunk_callback:
            resp = await self.prompt_client.kg_prompt(
                query, kg,
                streaming=True,
                chunk_callback=chunk_callback
            )
        else:
            resp = await self.prompt_client.kg_prompt(query, kg)
        if self.verbose:
            logger.debug("Query processing complete")
--- a/trustgraph-flow/trustgraph/retrieval/graph_rag/rag.py
+++ b/trustgraph-flow/trustgraph/retrieval/graph_rag/rag.py
@ -135,20 +135,56 @@ class Processor(FlowProcessor):
            else:
                max_path_length = self.default_max_path_length
-            response = await rag.query(
+            # Check if streaming is requested
-                query = v.query, user = v.user, collection = v.collection,
+            if v.streaming:
-                entity_limit = entity_limit, triple_limit = triple_limit,
+                # Define async callback for streaming chunks
-                max_subgraph_size = max_subgraph_size,
+                async def send_chunk(chunk):
-                max_path_length = max_path_length,
+                    await flow("response").send(
-            )
+                        GraphRagResponse(
                            chunk=chunk,
                            end_of_stream=False,
                            response=None,
                            error=None
                        ),
                        properties={"id": id}
                    )
-            await flow("response").send(
+                # Query with streaming enabled
-                GraphRagResponse(
+                full_response = await rag.query(
-                    response = response,
+                    query = v.query, user = v.user, collection = v.collection,
-                    error = None
+                    entity_limit = entity_limit, triple_limit = triple_limit,
-                ),
+                    max_subgraph_size = max_subgraph_size,
-                properties = {"id": id}
+                    max_path_length = max_path_length,
-            )
+                    streaming = True,
                    chunk_callback = send_chunk,
                )
                # Send final message with complete response
                await flow("response").send(
                    GraphRagResponse(
                        chunk=None,
                        end_of_stream=True,
                        response=full_response,
                        error=None
                    ),
                    properties={"id": id}
                )
            else:
                # Non-streaming path (existing behavior)
                response = await rag.query(
                    query = v.query, user = v.user, collection = v.collection,
                    entity_limit = entity_limit, triple_limit = triple_limit,
                    max_subgraph_size = max_subgraph_size,
                    max_path_length = max_path_length,
                )
                await flow("response").send(
                    GraphRagResponse(
                        response = response,
                        error = None
                    ),
                    properties = {"id": id}
                )
            logger.info("Request processing complete")
@ -158,14 +194,21 @@ class Processor(FlowProcessor):
            logger.debug("Sending error response...")
-            await flow("response").send(
+            # Send error response with end_of_stream flag if streaming was requested
-                GraphRagResponse(
+            error_response = GraphRagResponse(
-                    response = None,
+                response = None,
-                    error = Error(
+                error = Error(
-                        type = "graph-rag-error",
+                    type = "graph-rag-error",
-                        message = str(e),
+                    message = str(e),
                    ),
                ),
            )
            # If streaming was requested, indicate stream end
            if v.streaming:
                error_response.end_of_stream = True
            await flow("response").send(
                error_response,
                properties = {"id": id}
            )
--- a/trustgraph-ocr/pyproject.toml
+++ b/trustgraph-ocr/pyproject.toml
@ -10,7 +10,7 @@ description = "TrustGraph provides a means to run a pipeline of flexible AI proc
 readme = "README.md"
 requires-python = ">=3.8"
 dependencies = [
-    "trustgraph-base>=1.5,<1.6",
+    "trustgraph-base>=1.6,<1.7",
    "pulsar-client",
    "prometheus-client",
    "boto3",
--- a/trustgraph-vertexai/pyproject.toml
+++ b/trustgraph-vertexai/pyproject.toml
@ -10,7 +10,7 @@ description = "TrustGraph provides a means to run a pipeline of flexible AI proc
 readme = "README.md"
 requires-python = ">=3.8"
 dependencies = [
-    "trustgraph-base>=1.5,<1.6",
+    "trustgraph-base>=1.6,<1.7",
    "pulsar-client",
    "google-cloud-aiplatform",
    "prometheus-client",
--- a/trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py
+++ b/trustgraph-vertexai/trustgraph/model/text_completion/vertexai/llm.py
@ -32,7 +32,7 @@ from vertexai.generative_models import (
 from anthropic import AnthropicVertex, RateLimitError
 from .... exceptions import TooManyRequests
-from .... base import LlmService, LlmResult
+from .... base import LlmService, LlmResult, LlmChunk
 # Module logger
 logger = logging.getLogger(__name__)
@ -239,6 +239,123 @@ class Processor(LlmService):
            logger.error(f"VertexAI LLM exception: {e}", exc_info=True)
            raise e
    def supports_streaming(self):
        """VertexAI supports streaming for both Gemini and Claude models"""
        return True
    async def generate_content_stream(self, system, prompt, model=None, temperature=None):
        """
        Stream content generation from VertexAI (Gemini or Claude).
        Yields LlmChunk objects with is_final=True on the last chunk.
        """
        # Use provided model or fall back to default
        model_name = model or self.default_model
        # Use provided temperature or fall back to default
        effective_temperature = temperature if temperature is not None else self.temperature
        logger.debug(f"Using model (streaming): {model_name}")
        logger.debug(f"Using temperature: {effective_temperature}")
        try:
            if 'claude' in model_name.lower():
                # Claude/Anthropic streaming
                logger.debug(f"Streaming request to Anthropic model '{model_name}'...")
                client = self._get_anthropic_client()
                total_in_tokens = 0
                total_out_tokens = 0
                with client.messages.stream(
                    model=model_name,
                    system=system,
                    messages=[{"role": "user", "content": prompt}],
                    max_tokens=self.api_params['max_output_tokens'],
                    temperature=effective_temperature,
                    top_p=self.api_params['top_p'],
                    top_k=self.api_params['top_k'],
                ) as stream:
                    # Stream text chunks
                    for text in stream.text_stream:
                        yield LlmChunk(
                            text=text,
                            in_token=None,
                            out_token=None,
                            model=model_name,
                            is_final=False
                        )
                    # Get final message with token counts
                    final_message = stream.get_final_message()
                    total_in_tokens = final_message.usage.input_tokens
                    total_out_tokens = final_message.usage.output_tokens
                # Send final chunk with token counts
                yield LlmChunk(
                    text="",
                    in_token=total_in_tokens,
                    out_token=total_out_tokens,
                    model=model_name,
                    is_final=True
                )
                logger.info(f"Input Tokens: {total_in_tokens}")
                logger.info(f"Output Tokens: {total_out_tokens}")
            else:
                # Gemini streaming
                logger.debug(f"Streaming request to Gemini model '{model_name}'...")
                full_prompt = system + "\n\n" + prompt
                llm, generation_config = self._get_gemini_model(model_name, effective_temperature)
                response = llm.generate_content(
                    full_prompt,
                    generation_config=generation_config,
                    safety_settings=self.safety_settings,
                    stream=True  # Enable streaming
                )
                total_in_tokens = 0
                total_out_tokens = 0
                # Stream chunks
                for chunk in response:
                    if chunk.text:
                        yield LlmChunk(
                            text=chunk.text,
                            in_token=None,
                            out_token=None,
                            model=model_name,
                            is_final=False
                        )
                    # Accumulate token counts if available
                    if hasattr(chunk, 'usage_metadata') and chunk.usage_metadata:
                        if hasattr(chunk.usage_metadata, 'prompt_token_count'):
                            total_in_tokens = chunk.usage_metadata.prompt_token_count
                        if hasattr(chunk.usage_metadata, 'candidates_token_count'):
                            total_out_tokens = chunk.usage_metadata.candidates_token_count
                # Send final chunk with token counts
                yield LlmChunk(
                    text="",
                    in_token=total_in_tokens,
                    out_token=total_out_tokens,
                    model=model_name,
                    is_final=True
                )
                logger.info(f"Input Tokens: {total_in_tokens}")
                logger.info(f"Output Tokens: {total_out_tokens}")
        except (google.api_core.exceptions.ResourceExhausted, RateLimitError) as e:
            logger.warning(f"Hit rate limit during streaming: {e}")
            raise TooManyRequests()
        except Exception as e:
            logger.error(f"VertexAI streaming exception: {e}", exc_info=True)
            raise e
    @staticmethod
    def add_args(parser):
--- a/trustgraph/pyproject.toml
+++ b/trustgraph/pyproject.toml
@ -10,12 +10,12 @@ description = "TrustGraph provides a means to run a pipeline of flexible AI proc
 readme = "README.md"
 requires-python = ">=3.8"
 dependencies = [
-    "trustgraph-base>=1.5,<1.6",
+    "trustgraph-base>=1.6,<1.7",
-    "trustgraph-bedrock>=1.5,<1.6",
+    "trustgraph-bedrock>=1.6,<1.7",
-    "trustgraph-cli>=1.5,<1.6",
+    "trustgraph-cli>=1.6,<1.7",
-    "trustgraph-embeddings-hf>=1.5,<1.6",
+    "trustgraph-embeddings-hf>=1.6,<1.7",
-    "trustgraph-flow>=1.5,<1.6",
+    "trustgraph-flow>=1.6,<1.7",
-    "trustgraph-vertexai>=1.5,<1.6",
+    "trustgraph-vertexai>=1.6,<1.7",
 ]
 classifiers = [
    "Programming Language :: Python :: 3",