master -> 1.5 (README updates) (#552)

2026-04-25 00:16:23 +02:00 · 2025-10-11 11:46:03 +01:00 · 2025-10-11 11:46:03 +01:00 · 51107008fd
commit 51107008fd
parent ad35656811
18 changed files with 891 additions and 120 deletions
--- a/README.md
+++ b/README.md
@ -1,30 +1,33 @@
-<img src="tg-adapter.png" width=100% />
-
 <div align="center">

-## The Sovereign Universal AI Adapter
+## The Agentic AI Platform for Enterprise Availability, Scalability, and Security

-[![PyPI version](https://img.shields.io/pypi/v/trustgraph.svg)](https://pypi.org/project/trustgraph/) [![Discord](https://img.shields.io/discord/1251652173201149994
-)](https://discord.gg/sQMwkRz5GX)
+<img src="product-platform-diagram.svg" width=100% />

-[Full Docs](https://docs.trustgraph.ai/docs/TrustGraph) | [YouTube](https://www.youtube.com/@TrustGraphAI?sub_confirmation=1) | [Configuration Builder](https://config-ui.demo.trustgraph.ai/) | [API Docs](docs/apis/README.md) | [CLI Docs](docs/cli/README.md) | [Discord](https://discord.gg/sQMwkRz5GX) | [Blog](https://blog.trustgraph.ai/subscribe)
+---
+
+[![PyPI version](https://img.shields.io/pypi/v/trustgraph.svg)](https://pypi.org/project/trustgraph/) ![E2E Tests](https://github.com/trustgraph-ai/trustgraph/actions/workflows/release.yaml/badge.svg)
+[![Discord](https://img.shields.io/discord/1251652173201149994
+)](https://discord.gg/sQMwkRz5GX) [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/trustgraph-ai/trustgraph)
+
+[**Docs**](https://docs.trustgraph.ai) | [**YouTube**](https://www.youtube.com/@TrustGraphAI?sub_confirmation=1) | [**Configuration Builder**](https://config-ui.demo.trustgraph.ai/) | [**Discord**](https://discord.gg/sQMwkRz5GX) | [**Blog**](https://blog.trustgraph.ai/subscribe)

 </div>

-Take control of your data and AI future with **TrustGraph**. Universal connectors can call the latest LLMs or deploy models on your hardware. **TrustGraph** future-proofs your AI strategy with graph driven intelligence that can deploy in any environment.
-
---
+**TrustGraph** is an agentic AI platform built to meet the enterprise demands for availability, scalability, and security. TrustGraph meets these demands by combining the enterprise-grade data streaming platform [Apache Pulsar](https://github.com/apache/pulsar/) with knowledge graphs, structured data storage, VectorDBs, and MCP interoperability all in a single containerized platform.

 <details>
 <summary>Table of Contents</summary>
 <br>

+- [**Key Features**](#key-features)<br>
 - [**Why TrustGraph?**](#why-trustgraph)<br>
+- [**Agentic MCP Demo**](#agentic-mcp-demo)<br>
 - [**Getting Started**](#getting-started)<br>
 - [**Configuration Builder**](#configuration-builder)<br>
- [**GraphRAG**](#graphrag)<br>
- [**Knowledge Packages**](#knowledge-packages)<br>
- [**Architecture**](#architecture)<br>
+- [**Context Engineering**](#context-engineering)<br>
+- [**Knowledge Cores**](#knowledge-cores)<br>
+- [**Platform Architecture**](#platform-architecture)<br>
 - [**Integrations**](#integrations)<br>
 - [**Observability & Telemetry**](#observability--telemetry)<br>
 - [**Contributing**](#contributing)<br>
@ -33,40 +36,37 @@ Take control of your data and AI future with **TrustGraph**. Universal connector

 </details>

---
+## Key Features
+
+To meet the demands of enterprises, a platform needs to enable multi-tenancy, user and agentic access controls, data management, and total data privacy. TrustGraph enables these capabilities with:
+
+- **Flows and Flow Classes -> Multi-tenancy**. *Flow classes are sets of processing components that can be combined into logically separate flows for both users and agents.*
+- **Collections -> User/agent access controls and data management**. *Collections enable grouping data with custom labels that can be used for limiting data access to both users and agents. Collections can be added, deleted, and listed.*
+- **Tool Groups -> Multi-agent**. *Create groups for agent tools for multi-agent flows within a single deployment.*
+- **Knowledge Cores -> Data management and data privacy**. *Knowledge cores are modular and reusable components of knowledge graphs and vector embeddings that can serve as "long-term memory".*
+- **Fully Containerized Platform with Private Model Serving -> Total data privacy**. *The entire TrustGraph platform can be deployed in any environment while managing the deployment of private LLMs for total data sovereignty.*
+- **No-LLM Knowledge Graph Retrieval -> Deterministic Natural Language Graph Retrieval**. *TrustGraph does *not* use LLMs for knowledge graph retrieval. Natural language queries use semantic similarity search as the basis for building graph queries without LLMs enabling true graph enhanced agentic flows.*

 ## Why TrustGraph?

-If you want to build powerful, intelligent AI applications without getting bogged down by complex infrastructure, brittle data pipelines, or opaque "black box" systems, TrustGraph is the platform that accelerates your AI transformation by solving these core problems.
+[![Why TrustGraph?](https://img.youtube.com/vi/Norboj8YP2M/maxresdefault.jpg)](https://www.youtube.com/watch?v=Norboj8YP2M)

- **Go Beyond Basic RAG with GraphRAG**: Stop building agents that just retrieve text snippets. TrustGraph provides the tooling to automatically build and query Knowledge Graphs combined with Vector Embeddings, enabling you to create applications with deep contextual reasoning and higher accuracy.
- **Decouple Your App from the AI Stack**: Our modular, containerized architecture lets you deploy anywhere (Docker, K8s, bare-metal) and swap out components (LLMs, vector DBs, graph DBs) without re-architecting your core application. Write your app once, knowing the underlying AI stack can evolve.
- **Automate the Knowledge Pipeline**: Focus on building your application's logic, not on writing ETL scripts for AI. TrustGraph provides a unified platform to ingest data from silos, transform it into structured Knowledge Packages, and deliver it to your AI – streamlining the entire "knowledge supply chain."
- **Enjoy Full Transparency & Control**: As an open-source platform, you get complete visibility into the system's inner workings. Debug more effectively, customize components to your needs, and maintain total control over your application's data flow and security, eliminating vendor lock-in.
+## Agentic MCP Demo
+
+[![Agentic MCP Demo](https://img.youtube.com/vi/mUCL1b1lmbA/maxresdefault.jpg)](https://www.youtube.com/watch?v=mUCL1b1lmbA)

 ## Getting Started

-This is a very-quickstart.  See [other installation options](docs/README.md#ways-to-deploy).
+- [**Quickstart Guide**](https://docs.trustgraph.ai/getting-started/)
+- [**Configuration Builder**](#configuration-builder)
+- [**Workbench**](#workbench)
+- [**Developer APIs and CLI**](https://docs.trustgraph.ai/reference/)
+- [**Example Notebooks**](https://github.com/trustgraph-ai/example-notebooks)
+- [**Deployment Guide**](https://docs.trustgraph.ai/deployment/)

- [Configuration Builder](#configuration-builder)
- [Install the CLI](#install-the-trustgraph-cli)
- [Test Suite](#test-suite)
+### Watch TrustGraph 101

-### Developer APIs and CLI
-
- [**REST API**](docs/apis/README.md#rest-apis)
- [**Websocket API**](docs/apis/README.md#websocket-api)
- [**Python SDK**](https://trustgraph.ai/docs/api/apistarted)
- [**TrustGraph CLI**](https://trustgraph.ai/docs/running/cli)
-
-### Install the TrustGraph CLI
-
-```
-pip3 install trustgraph-cli==<trustgraph-version>
-```
-
-> [!CAUTION]
-> The `trustgraph-cli` version *must* match the selected **TrustGraph** release version. 
+[![TrustGraph 101](https://img.youtube.com/vi/rWYl_yhKCng/maxresdefault.jpg)](https://www.youtube.com/watch?v=rWYl_yhKCng)

 ## Configuration Builder

@ -74,55 +74,51 @@ The [**Configuration Builder**](https://config-ui.demo.trustgraph.ai/) assembles

 - **Version**: Select the version of TrustGraph you'd like to deploy
 - **Component Selection**: Choose from the available deployment platforms, LLMs, graph store, VectorDB, chunking algorithm, chunking parameters, and LLM parameters
- **Customization**: Customize the prompts for the LLM System, Data Extraction Agents, and Agent Flow
+- **Customization**: Enable OCR pipelines and custom embeddings models
 - **Finish Deployment**: Download the launch `YAML` files with deployment instructions

-### Test Suite
+## Workbench

-If added to the build in the **Configuration Builder**, the **Test Suite** will be available at port `8888`. The **Test Suite** has the following capabilities:
+The **Workbench** is a UI that provides tools for interacting with all major features of the platform. The **Workbench** is enabled by default in the **Configuration Builder** and is available at port `8888` on deployment. The **Workbench** has the following capabilities:

- **GraphRAG Chat**: GraphRAG queries in a chat interface
- **Vector Search**: Semantic similarity search with cosine similarity scores
- **Semantic Relationships**: See semantic relationships in a list structure
- **Graph Visualizer**: Visualize semantic relationships in **3D**
- **Data Loader**: Directly load `.pdf`, `.txt`, or `.md` into the system with document metadata
+- **Agentic, GraphRAG and LLM Chat**: Chat interface for agentic flows, GraphRAG queries, or directly interfacing with a LLM
+- **Semantic Discovery**: Analyze semantic relationships with vector search, knowledge graph relationships, and 3D graph visualization
+- **Data Management**: Load data into the **Librarian** for processing, create and upload **Knowledge Packages**
+- **Flow Management**: Create and delete processing flow patterns
+- **Prompt Management**: Edit all LLM prompts used in the platform during runtime
+- **Agent Tools**: Define tools used by the Agent Flow including MCP tools
+- **MCP Tools**: Connect to MCP servers

-## GraphRAG
+## Context Engineering

-TrustGraph features an advanced GraphRAG approach that automatically constructs Knowledge Graphs with mapped Vector Embeddings to provide richer and more accurate context to LLMs for trustworthy agents.
-
-**How TrustGraph's GraphRAG Works:**
+TrustGraph features a complete context engineering solution combinging the power of Knowledge Graphs and VectorDBs. Connect your data to automatically construct Knowledge Graphs with mapped Vector Embeddings to deliver richer and more accurate context to LLMs for trustworthy agents.

 - **Automated Knowledge Graph Construction:** Data Transformation Agents processes source data to automatically **extract key entities, topics, and the relationships** connecting them. Vector emebeddings are then mapped to these semantic relationships for context retrieval.
- **Hybrid Retrieval:** When an agent needs to perform deep research, it first performs a **cosine similarity search** on the vector embeddings to identify potentially relevant concepts and relationships within the knowledge graph. This initial vector search **pinpoints relevant entry points** within the structured Knowledge Graph.
+- **Deterministic Graph Retrieval:** Semantic relationsips are retrieved from the knowledge graph *without* the use of LLMs. When an agent needs to perform deep research, it first performs a **cosine similarity search** on the vector embeddings to identify potentially relevant concepts and relationships within the knowledge graph. This initial vector search **pinpoints relevant entry points** within the structured Knowledge Graph which gets built into graph queries *without* LLMs that retrieve the relevant subgraphs.
 - **Context Generation via Subgraph Traversal:** Based on the ranked results from the similarity search, agents are provided with only the relevant subgraphs for **deep context**. Users can configure the **number of 'hops'** (relationship traversals) to extend the depth of knowledge availabe to the agents. This structured **subgraph**, containing entities and their relationships, forms a highly relevant and context-aware input prompt for the LLM that is endlessly configurable with options for the number of entities, relationships, and overall subgraph size.

-## Knowledge Packages
+## Knowledge Cores

-One of the biggest challenges currently facing RAG architectures is the ability to quickly reuse and integrate knowledge sets. **TrustGraph** solves this problem by storing the results of the data ingestion process in reusable Knowledge Packages. Being able to store and reuse the Knowledge Packages means the data transformation process has to be run only once. These reusable Knowledge Packages can be loaded back into **TrustGraph** and used for GraphRAG.
+One of the biggest challenges currently facing RAG architectures is the ability to quickly reuse and integrate knowledge sets like long-term memory for LLMs. **TrustGraph** solves this problem by storing the results of the data ingestion process in reusable Knowledge Cores. Being able to store and reuse the Knowledge Cores means the data transformation process has to be run only once. These reusable Knowledge Cores can be loaded back into **TrustGraph** and used for GraphRAG. Some sample knowledge cores are available for download [here](https://github.com/trustgraph-ai/catalog/tree/master/v3).

-A Knowledge Package has two components:
+A Knowledge Core has two components:

 - Set of Graph Edges
 - Set of mapped Vector Embeddings

-When a Knowledge Package is loaded into TrustGraph, the corresponding graph edges and vector embeddings are queued and loaded into the chosen graph and vector stores.
+When a Knowledge Core is loaded into TrustGraph, the corresponding graph edges and vector embeddings are queued and loaded into the chosen graph and vector stores.

-## Architecture
-
-The platform contains the services, stores, control plane, and API gateway needed to connect your data to intelligent agents.
-
-![architecture](TG-platform-diagram.svg)
+## Platform Architecture

 The platform orchestrates a comprehensive suite of services to transform external data into intelligent, actionable outputs for AI agents and users. It interacts with external data sources and external services (like LLM APIs) via an **API Gateway**.

 Within the **TrustGraph** Platform, the services are grouped as follows:

- **Data Orchestration:** This crucial set of services manages the entire lifecycle of ingesting and preparing data to become AI-ready knowledge. It includes **Data Ingest** capabilities for various data types, a *Data Librarian* for managing and cataloging this information, *Data Transformation* services to clean, structure, and refine raw data, and ultimately produces consumable *Knowledge Packages* – the structured, enriched knowledge artifacts for AI.
+- **Data Orchestration:** This crucial set of services manages the entire lifecycle of ingesting and preparing data to become AI-ready knowledge. It includes **Data Ingest** capabilities for various data types, a *Data Librarian* for managing and cataloging this information, *Data Transformation* services to clean, structure, and refine raw data, and ultimately produces consumable *Knowledge Cores* – the structured, enriched knowledge artifacts for AI.
 - **Data Storage:** The platform relies on a flexible storage layer designed to handle the diverse needs of AI applications. This includes dedicated storage for *Knowledge Graphs* (to represent interconnected relationships), *VectorDBs* (for efficient semantic similarity search on embeddings), and *Tabular Datastores* (for structured data).
- **Intelligence Orchestration:** This is the core reasoning engine of the platform. It leverages the structured knowledge from the Storage layer to perform *Deep Knowledge Retrieval* (advanced search and context discovery beyond simple keyword matching) and facilitate *Agentic Thinking*, enabling AI agents to process information and form complex responses or action plans.
+- **Context Orchestration:** This is the core reasoning engine of the platform. It leverages the structured knowledge from the Storage layer to perform *Deep Knowledge Retrieval* (advanced search and context discovery beyond simple keyword matching) and facilitate *Agentic Thinking*, enabling AI agents to process information and form complex responses or action plans.
 - **Agent Orchestration:** This group of services is dedicated to managing and empowering the AI agents themselves. The *Agent Manager* handles the lifecycle, configuration, and operation of agents, while *Agent Tools* provide a framework or library of capabilities that agents can utilize to perform actions or interact with other systems.
- **Model Orchestration:** This layer is responsible for the deployment, management, and operationalization of the various AI models TrustGraph uses or provides to agents. This includes *LLM Deployment*, *Embeddings Deployment*, and *OCR Deployment*. Crucially, it features *Cross Hardware Support*, indicating the platform's ability to run these models across diverse computing environments.
+- **Private Model Serving:** This layer is responsible for the deployment, management, and operationalization of the various AI models TrustGraph uses or provides to agents. This includes *LLM Deployment*, *Embeddings Deployment*, and *OCR Deployment*. Crucially, it features *Cross Hardware Support*, indicating the platform's ability to run these models across diverse computing environments.
 - **Prompt Management:** Effective interaction with AI, especially LLMs and agents, requires precise instruction. This service centralizes the management of all prompt types: *LLM System Prompts* (to define an LLM's persona or core instructions), *Data Transformation Prompts* (to guide AI in structuring data), **RAG Context** generation (providing relevant intelligence to LLMs), and *Agent Definitions* (the core instructions and goals for AI agents).
 - **Platform Services:** These foundational services provide the essential operational backbone for the entire TrustGraph platform, ensuring it runs securely, reliably, and efficiently. This includes *Access Controls* (for security and permissions), *Secrets Management* (for handling sensitive credentials), *Logging* (for audit and diagnostics), *Observability* (for monitoring platform health and performance), *Realtime Cost Observability* (for tracking resource consumption expenses), and *Hardware Resource Management* (for optimizing the use of underlying compute).

@ -197,44 +193,11 @@ TrustGraph provides maximum flexibility so your agents are always powered by the
 - Azure<br>
 - Google Cloud<br>
 - Intel Tiber Cloud<br>
+- OVHcloud<br>
 - Scaleway<br>

 </details>

-### Pulsar Control Plane
-
- For flows, Pulsar accepts the output of a processing module and queues it for input to the next subscribed module.
- For services such as LLMs and embeddings, Pulsar provides a client/server model.  A Pulsar queue is used as the input to the service.  When processed, the output is then delivered to a separate queue where a client subscriber can request that output.
-
-PDF file:
-```
-tg-load-pdf <document.pdf>
-```
-
-Text or Markdown file:
-```
-tg-load-text <document.txt>
-```
-
-### GraphRAG Queries
-
-Once the knowledge graph and embeddings have been built or a cognitive core has been loaded, RAG queries are launched with a single line:
-
-```
-tg-invoke-graph-rag -q "What are the top 3 takeaways from the document?"
-```
-
-### Agent Flow
-
-Invoking the Agent Flow will use a ReAct style approach the combines Graph RAG and text completion requests to think through a problem solution.
-
-```
-tg-invoke-agent -v -q "Write a blog post on the top 3 takeaways from the document."
-```
-
-> [!TIP]
-> Adding `-v` to the agent request will return all of the agent manager's thoughts and observations that led to the final response.
-
 ## Observability & Telemetry

 Once the platform is running, access the Grafana dashboard at:
@ -252,22 +215,28 @@ password: admin

 The default Grafana dashboard tracks the following:

- LLM Latency
- Error Rate
- Service Request Rates
- Queue Backlogs
- Chunking Histogram
- Error Source by Service
- Rate Limit Events
- CPU usage by Service
- Memory usage by Service
- Models Deployed
- Token Throughput (Tokens/second)
- Cost Throughput (Cost/second)
+<details>
+<summary>Telemetry</summary>
+<br>
+
+- LLM Latency<br>
+- Error Rate<br>
+- Service Request Rates<br>
+- Queue Backlogs<br>
+- Chunking Histogram<br>
+- Error Source by Service<br>
+- Rate Limit Events<br>
+- CPU usage by Service<br>
+- Memory usage by Service<br>
+- Models Deployed<br>
+- Token Throughput (Tokens/second)<br>
+- Cost Throughput (Cost/second)<br>
+   
+</details>

 ## Contributing

-[Developing for TrustGraph](docs/README.development.md)
+[Developer's Guide](https://docs.trustgraph.ai/community/developer.html)

 ## License

@ -290,4 +259,4 @@ The default Grafana dashboard tracks the following:
 ## Support & Community
 - Bug Reports & Feature Requests: [Discord](https://discord.gg/sQMwkRz5GX)
 - Discussions & Questions: [Discord](https://discord.gg/sQMwkRz5GX)
- Documentation: [Docs](https://docs.trustgraph.ai/docs/getstarted)
+- Documentation: [Docs](https://docs.trustgraph.ai/)
--- a/TG-platform-diagram.svg
+++ b/TG-platform-diagram.svg
--- a/TG-ship.jpg
+++ b/TG-ship.jpg
--- a/docs/cli/README.md
+++ b/docs/cli/README.md
@ -10,6 +10,9 @@ The CLI tools are installed as part of the `trustgraph-cli` package:
 pip install trustgraph-cli
 ```

+> [!NOTE]
+> The CLI version should match the version of TrustGraph being deployed. 
+
 ## Global Options

 Most CLI commands support these common options:
--- a/docs/tech-specs/ARCHITECTURE_PRINCIPLES.md
+++ b/docs/tech-specs/ARCHITECTURE_PRINCIPLES.md
@ -0,0 +1,106 @@
+# Knowledge Graph Architecture Foundations
+
+## Foundation 1: Subject-Predicate-Object (SPO) Graph Model
+**Decision**: Adopt SPO/RDF as the core knowledge representation model
+
+**Rationale**:
+- Provides maximum flexibility and interoperability with existing graph technologies
+- Enables seamless translation to other graph query languages (e.g., SPO → Cypher, but not vice versa)
+- Creates a foundation that "unlocks a lot" of downstream capabilities
+- Supports both node-to-node relationships (SPO) and node-to-literal relationships (RDF)
+
+**Implementation**: 
+- Core data structure: `node → edge → {node | literal}`
+- Maintain compatibility with RDF standards while supporting extended SPO operations
+
+## Foundation 2: LLM-Native Knowledge Graph Integration
+**Decision**: Optimize knowledge graph structure and operations for LLM interaction
+
+**Rationale**:
+- Primary use case involves LLMs interfacing with knowledge graphs
+- Graph technology choices must prioritize LLM compatibility over other considerations
+- Enables natural language processing workflows that leverage structured knowledge
+
+**Implementation**:
+- Design graph schemas that LLMs can effectively reason about
+- Optimize for common LLM interaction patterns
+
+## Foundation 3: Embedding-Based Graph Navigation
+**Decision**: Implement direct mapping from natural language queries to graph nodes via embeddings
+
+**Rationale**:
+- Enables the simplest possible path from NLP query to graph navigation
+- Avoids complex intermediate query generation steps
+- Provides efficient semantic search capabilities within the graph structure
+
+**Implementation**:
+- `NLP Query → Graph Embeddings → Graph Nodes`
+- Maintain embedding representations for all graph entities
+- Support direct semantic similarity matching for query resolution
+
+## Foundation 4: Distributed Entity Resolution with Deterministic Identifiers
+**Decision**: Support parallel knowledge extraction with deterministic entity identification (80% rule)
+
+**Rationale**:
+- **Ideal**: Single-process extraction with complete state visibility enables perfect entity resolution
+- **Reality**: Scalability requirements demand parallel processing capabilities
+- **Compromise**: Design for deterministic entity identification across distributed processes
+
+**Implementation**:
+- Develop mechanisms for generating consistent, unique identifiers across different knowledge extractors
+- Same entity mentioned in different processes must resolve to the same identifier
+- Acknowledge that ~20% of edge cases may require alternative processing models
+- Design fallback mechanisms for complex entity resolution scenarios
+
+## Foundation 5: Event-Driven Architecture with Publish-Subscribe
+**Decision**: Implement pub-sub messaging system for system coordination
+
+**Rationale**:
+- Enables loose coupling between knowledge extraction, storage, and query components
+- Supports real-time updates and notifications across the system
+- Facilitates scalable, distributed processing workflows
+
+**Implementation**:
+- Message-driven coordination between system components
+- Event streams for knowledge updates, extraction completion, and query results
+
+## Foundation 6: Reentrant Agent Communication
+**Decision**: Support reentrant pub-sub operations for agent-based processing
+
+**Rationale**:
+- Enables sophisticated agent workflows where agents can trigger and respond to each other
+- Supports complex, multi-step knowledge processing pipelines
+- Allows for recursive and iterative processing patterns
+
+**Implementation**:
+- Pub-sub system must handle reentrant calls safely
+- Agent coordination mechanisms that prevent infinite loops
+- Support for agent workflow orchestration
+
+## Foundation 7: Columnar Data Store Integration
+**Decision**: Ensure query compatibility with columnar storage systems
+
+**Rationale**:
+- Enables efficient analytical queries over large knowledge datasets
+- Supports business intelligence and reporting use cases
+- Bridges graph-based knowledge representation with traditional analytical workflows
+
+**Implementation**:
+- Query translation layer: Graph queries → Columnar queries
+- Hybrid storage strategy supporting both graph operations and analytical workloads
+- Maintain query performance across both paradigms
+
+---
+
+## Architecture Principles Summary
+
+1. **Flexibility First**: SPO/RDF model provides maximum adaptability
+2. **LLM Optimization**: All design decisions consider LLM interaction requirements
+3. **Semantic Efficiency**: Direct embedding-to-node mapping for optimal query performance
+4. **Pragmatic Scalability**: Balance perfect accuracy with practical distributed processing
+5. **Event-Driven Coordination**: Pub-sub enables loose coupling and scalability
+6. **Agent-Friendly**: Support complex, multi-agent processing workflows
+7. **Analytical Compatibility**: Bridge graph and columnar paradigms for comprehensive querying
+
+These foundations establish a knowledge graph architecture that balances theoretical rigor with practical scalability requirements, optimized for LLM integration and distributed processing.
+
--- a/docs/tech-specs/LOGGING_STRATEGY.md
+++ b/docs/tech-specs/LOGGING_STRATEGY.md
@ -0,0 +1,169 @@
+# TrustGraph Logging Strategy
+
+## Overview
+
+TrustGraph uses Python's built-in `logging` module for all logging operations. This provides a standardized, flexible approach to logging across all components of the system.
+
+## Default Configuration
+
+### Logging Level
+- **Default Level**: `INFO`
+- **Debug Mode**: `DEBUG` (enabled via command-line argument)
+- **Production**: `WARNING` or `ERROR` as appropriate
+
+### Output Destination
+All logs should be written to **standard output (stdout)** to ensure compatibility with containerized environments and log aggregation systems.
+
+## Implementation Guidelines
+
+### 1. Logger Initialization
+
+Each module should create its own logger using the module's `__name__`:
+
+```python
+import logging
+
+logger = logging.getLogger(__name__)
+```
+
+### 2. Centralized Configuration
+
+The logging configuration should be centralized in `async_processor.py` (or a dedicated logging configuration module) since it's inherited by much of the codebase:
+
+```python
+import logging
+import argparse
+
+def setup_logging(log_level='INFO'):
+    """Configure logging for the entire application"""
+    logging.basicConfig(
+        level=getattr(logging, log_level.upper()),
+        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+        handlers=[logging.StreamHandler()]
+    )
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        '--log-level',
+        default='INFO',
+        choices=['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'],
+        help='Set the logging level (default: INFO)'
+    )
+    return parser.parse_args()
+
+# In main execution
+if __name__ == '__main__':
+    args = parse_args()
+    setup_logging(args.log_level)
+```
+
+### 3. Logging Best Practices
+
+#### Log Levels Usage
+- **DEBUG**: Detailed information for diagnosing problems (variable values, function entry/exit)
+- **INFO**: General informational messages (service started, configuration loaded, processing milestones)
+- **WARNING**: Warning messages for potentially harmful situations (deprecated features, recoverable errors)
+- **ERROR**: Error messages for serious problems (failed operations, exceptions)
+- **CRITICAL**: Critical messages for system failures requiring immediate attention
+
+#### Message Format
+```python
+# Good - includes context
+logger.info(f"Processing document: {doc_id}, size: {doc_size} bytes")
+logger.error(f"Failed to connect to database: {error}", exc_info=True)
+
+# Avoid - lacks context
+logger.info("Processing document")
+logger.error("Connection failed")
+```
+
+#### Performance Considerations
+```python
+# Use lazy formatting for expensive operations
+logger.debug("Expensive operation result: %s", expensive_function())
+
+# Check log level for very expensive debug operations
+if logger.isEnabledFor(logging.DEBUG):
+    debug_data = compute_expensive_debug_info()
+    logger.debug(f"Debug data: {debug_data}")
+```
+
+### 4. Structured Logging
+
+For complex data, use structured logging:
+
+```python
+logger.info("Request processed", extra={
+    'request_id': request_id,
+    'duration_ms': duration,
+    'status_code': status_code,
+    'user_id': user_id
+})
+```
+
+### 5. Exception Logging
+
+Always include stack traces for exceptions:
+
+```python
+try:
+    process_data()
+except Exception as e:
+    logger.error(f"Failed to process data: {e}", exc_info=True)
+    raise
+```
+
+### 6. Async Logging Considerations
+
+For async code, ensure thread-safe logging:
+
+```python
+import asyncio
+import logging
+
+async def async_operation():
+    logger = logging.getLogger(__name__)
+    logger.info(f"Starting async operation in task: {asyncio.current_task().get_name()}")
+```
+
+## Environment Variables
+
+Support environment-based configuration as a fallback:
+
+```python
+import os
+
+log_level = os.environ.get('TRUSTGRAPH_LOG_LEVEL', 'INFO')
+```
+
+## Testing
+
+During tests, consider using a different logging configuration:
+
+```python
+# In test setup
+logging.getLogger().setLevel(logging.WARNING)  # Reduce noise during tests
+```
+
+## Monitoring Integration
+
+Ensure log format is compatible with monitoring tools:
+- Include timestamps in ISO format
+- Use consistent field names
+- Include correlation IDs where applicable
+- Structure logs for easy parsing (JSON format for production)
+
+## Security Considerations
+
+- Never log sensitive information (passwords, API keys, personal data)
+- Sanitize user input before logging
+- Use placeholders for sensitive fields: `user_id=****1234`
+
+## Migration Path
+
+For existing code using print statements:
+1. Replace `print()` with appropriate logger calls
+2. Choose appropriate log levels based on message importance
+3. Add context to make logs more useful
+4. Test logging output at different levels
--- a/docs/tech-specs/SCHEMA_REFACTORING_PROPOSAL.md
+++ b/docs/tech-specs/SCHEMA_REFACTORING_PROPOSAL.md
@ -0,0 +1,91 @@
+# Schema Directory Refactoring Proposal
+
+## Current Issues
+
+1. **Flat structure** - All schemas in one directory makes it hard to understand relationships
+2. **Mixed concerns** - Core types, domain objects, and API contracts all mixed together
+3. **Unclear naming** - Files like "object.py", "types.py", "topic.py" don't clearly indicate their purpose
+4. **No clear layering** - Can't easily see what depends on what
+
+## Proposed Structure
+
+```
+trustgraph-base/trustgraph/schema/
+├── __init__.py
+├── core/              # Core primitive types used everywhere
+│   ├── __init__.py
+│   ├── primitives.py  # Error, Value, Triple, Field, RowSchema
+│   ├── metadata.py    # Metadata record
+│   └── topic.py       # Topic utilities
+│
+├── knowledge/         # Knowledge domain models and extraction
+│   ├── __init__.py
+│   ├── graph.py       # EntityContext, EntityEmbeddings, Triples
+│   ├── document.py    # Document, TextDocument, Chunk
+│   ├── knowledge.py   # Knowledge extraction types
+│   ├── embeddings.py  # All embedding-related types (moved from multiple files)
+│   └── nlp.py         # Definition, Topic, Relationship, Fact types
+│
+└── services/          # Service request/response contracts
+    ├── __init__.py
+    ├── llm.py         # TextCompletion, Embeddings, Tool requests/responses
+    ├── retrieval.py   # GraphRAG, DocumentRAG queries/responses
+    ├── query.py       # GraphEmbeddingsRequest/Response, DocumentEmbeddingsRequest/Response
+    ├── agent.py       # Agent requests/responses
+    ├── flow.py        # Flow requests/responses
+    ├── prompt.py      # Prompt service requests/responses
+    ├── config.py      # Configuration service
+    ├── library.py     # Librarian service
+    └── lookup.py      # Lookup service
+```
+
+## Key Changes
+
+1. **Hierarchical organization** - Clear separation between core types, knowledge models, and service contracts
+2. **Better naming**:
+   - `types.py` → `core/primitives.py` (clearer purpose)
+   - `object.py` → Split between appropriate files based on actual content
+   - `documents.py` → `knowledge/document.py` (singular, consistent)
+   - `models.py` → `services/llm.py` (clearer what kind of models)
+   - `prompt.py` → Split: service parts to `services/prompt.py`, data types to `knowledge/nlp.py`
+
+3. **Logical grouping**:
+   - All embedding types consolidated in `knowledge/embeddings.py`
+   - All LLM-related service contracts in `services/llm.py`
+   - Clear separation of request/response pairs in services directory
+   - Knowledge extraction types grouped with other knowledge domain models
+
+4. **Dependency clarity**:
+   - Core types have no dependencies
+   - Knowledge models depend only on core
+   - Service contracts can depend on both core and knowledge models
+
+## Migration Benefits
+
+1. **Easier navigation** - Developers can quickly find what they need
+2. **Better modularity** - Clear boundaries between different concerns
+3. **Simpler imports** - More intuitive import paths
+4. **Future-proof** - Easy to add new knowledge types or services without cluttering
+
+## Example Import Changes
+
+```python
+# Before
+from trustgraph.schema import Error, Triple, GraphEmbeddings, TextCompletionRequest
+
+# After
+from trustgraph.schema.core import Error, Triple
+from trustgraph.schema.knowledge import GraphEmbeddings
+from trustgraph.schema.services import TextCompletionRequest
+```
+
+## Implementation Notes
+
+1. Keep backward compatibility by maintaining imports in root `__init__.py`
+2. Move files gradually, updating imports as needed
+3. Consider adding a `legacy.py` that imports everything for transition period
+4. Update documentation to reflect new structure
+
+<function_calls>
+<invoke name="TodoWrite">
+<parameter name="todos">[{"id": "1", "content": "Examine current schema directory structure", "status": "completed", "priority": "high"}, {"id": "2", "content": "Analyze schema files and their purposes", "status": "completed", "priority": "high"}, {"id": "3", "content": "Propose improved naming and structure", "status": "completed", "priority": "high"}]
--- a/docs/tech-specs/STRUCTURED_DATA.md
+++ b/docs/tech-specs/STRUCTURED_DATA.md
@ -0,0 +1,253 @@
+# Structured Data Technical Specification
+
+## Overview
+
+This specification describes the integration of TrustGraph with structured data flows, enabling the system to work with data that can be represented as rows in tables or objects in object stores. The integration supports four primary use cases:
+
+1. **Unstructured to Structured Extraction**: Read unstructured data sources, identify and extract object structures, and store them in a tabular format
+2. **Structured Data Ingestion**: Load data that is already in structured formats directly into the structured store alongside extracted data
+3. **Natural Language Querying**: Convert natural language questions into structured queries to extract matching data from the store
+4. **Direct Structured Querying**: Execute structured queries directly against the data store for precise data retrieval
+
+## Goals
+
+- **Unified Data Access**: Provide a single interface for accessing both structured and unstructured data within TrustGraph
+- **Seamless Integration**: Enable smooth interoperability between TrustGraph's graph-based knowledge representation and traditional structured data formats
+- **Flexible Extraction**: Support automatic extraction of structured data from various unstructured sources (documents, text, etc.)
+- **Query Versatility**: Allow users to query data using both natural language and structured query languages
+- **Data Consistency**: Maintain data integrity and consistency across different data representations
+- **Performance Optimization**: Ensure efficient storage and retrieval of structured data at scale
+- **Schema Flexibility**: Support both schema-on-write and schema-on-read approaches to accommodate diverse data sources
+- **Backwards Compatibility**: Preserve existing TrustGraph functionality while adding structured data capabilities
+
+## Background
+
+TrustGraph currently excels at processing unstructured data and building knowledge graphs from diverse sources. However, many enterprise use cases involve data that is inherently structured - customer records, transaction logs, inventory databases, and other tabular datasets. These structured datasets often need to be analyzed alongside unstructured content to provide comprehensive insights.
+
+Current limitations include:
+- No native support for ingesting pre-structured data formats (CSV, JSON arrays, database exports)
+- Inability to preserve the inherent structure when extracting tabular data from documents
+- Lack of efficient querying mechanisms for structured data patterns
+- Missing bridge between SQL-like queries and TrustGraph's graph queries
+
+This specification addresses these gaps by introducing a structured data layer that complements TrustGraph's existing capabilities. By supporting structured data natively, TrustGraph can:
+- Serve as a unified platform for both structured and unstructured data analysis
+- Enable hybrid queries that span both graph relationships and tabular data
+- Provide familiar interfaces for users accustomed to working with structured data
+- Unlock new use cases in data integration and business intelligence
+
+## Technical Design
+
+### Architecture
+
+The structured data integration requires the following technical components:
+
+1. **NLP-to-Structured-Query Service**
+   - Converts natural language questions into structured queries
+   - Supports multiple query language targets (initially SQL-like syntax)
+   - Integrates with existing TrustGraph NLP capabilities
+   
+   Module: trustgraph-flow/trustgraph/query/nlp_query/cassandra
+
+2. **Configuration Schema Support** ✅ **[COMPLETE]**
+   - Extended configuration system to store structured data schemas
+   - Support for defining table structures, field types, and relationships
+   - Schema versioning and migration capabilities
+
+3. **Object Extraction Module** ✅ **[COMPLETE]**
+   - Enhanced knowledge extractor flow integration
+   - Identifies and extracts structured objects from unstructured sources
+   - Maintains provenance and confidence scores
+   - Registers a config handler (example: trustgraph-flow/trustgraph/prompt/template/service.py) to receive config data and decode schema information
+   - Receives objects and decodes them to ExtractedObject objects for delivery on the Pulsar queue
+   - NOTE: There's existing code at `trustgraph-flow/trustgraph/extract/object/row/`. This was a previous attempt and will need to be majorly refactored as it doesn't conform to current APIs. Use it if it's useful, start from scratch if not.
+   - Requires a command-line interface: `kg-extract-objects`
+
+   Module: trustgraph-flow/trustgraph/extract/kg/objects/
+
+4. **Structured Store Writer Module** ✅ **[COMPLETE]**
+   - Receives objects in ExtractedObject format from Pulsar queues
+   - Initial implementation targeting Apache Cassandra as the structured data store
+   - Handles dynamic table creation based on schemas encountered
+   - Manages schema-to-Cassandra table mapping and data transformation
+   - Provides batch and streaming write operations for performance optimization
+   - No Pulsar outputs - this is a terminal service in the data flow
+
+   **Schema Handling**:
+   - Monitors incoming ExtractedObject messages for schema references
+   - When a new schema is encountered for the first time, automatically creates the corresponding Cassandra table
+   - Maintains a cache of known schemas to avoid redundant table creation attempts
+   - Should consider whether to receive schema definitions directly or rely on schema names in ExtractedObject messages
+
+   **Cassandra Table Mapping**:
+   - Keyspace is named after the `user` field from ExtractedObject's Metadata
+   - Table is named after the `schema_name` field from ExtractedObject
+   - Collection from Metadata becomes part of the partition key to ensure:
+     - Natural data distribution across Cassandra nodes
+     - Efficient queries within a specific collection
+     - Logical isolation between different data imports/sources
+   - Primary key structure: `PRIMARY KEY ((collection, <schema_primary_key_fields>), <clustering_keys>)`
+     - Collection is always the first component of the partition key
+     - Schema-defined primary key fields follow as part of the composite partition key
+     - This requires queries to specify the collection, ensuring predictable performance
+   - Field definitions map to Cassandra columns with type conversions:
+     - `string` → `text`
+     - `integer` → `int` or `bigint` based on size hint
+     - `float` → `float` or `double` based on precision needs
+     - `boolean` → `boolean`
+     - `timestamp` → `timestamp`
+     - `enum` → `text` with application-level validation
+   - Indexed fields create Cassandra secondary indexes (excluding fields already in the primary key)
+   - Required fields are enforced at the application level (Cassandra doesn't support NOT NULL)
+
+   **Object Storage**:
+   - Extracts values from ExtractedObject.values map
+   - Performs type conversion and validation before insertion
+   - Handles missing optional fields gracefully
+   - Maintains metadata about object provenance (source document, confidence scores)
+   - Supports idempotent writes to handle message replay scenarios
+
+   **Implementation Notes**:
+   - Existing code at `trustgraph-flow/trustgraph/storage/objects/cassandra/` is outdated and doesn't comply with current APIs
+   - Should reference `trustgraph-flow/trustgraph/storage/triples/cassandra` as an example of a working storage processor
+   - Needs evaluation of existing code for any reusable components before deciding to refactor or rewrite
+
+   Module: trustgraph-flow/trustgraph/storage/objects/cassandra
+
+5. **Structured Query Service**
+   - Accepts structured queries in defined formats
+   - Executes queries against the structured store
+   - Returns objects matching query criteria
+   - Supports pagination and result filtering
+
+   Module: trustgraph-flow/trustgraph/query/objects/cassandra
+
+6. **Agent Tool Integration**
+   - New tool class for agent frameworks
+   - Enables agents to query structured data stores
+   - Provides natural language and structured query interfaces
+   - Integrates with existing agent decision-making processes
+
+7. **Structured Data Ingestion Service**
+   - Accepts structured data in multiple formats (JSON, CSV, XML)
+   - Parses and validates incoming data against defined schemas
+   - Converts data into normalized object streams
+   - Emits objects to appropriate message queues for processing
+   - Supports bulk uploads and streaming ingestion
+
+   Module: trustgraph-flow/trustgraph/decoding/structured
+
+8. **Object Embedding Service**
+   - Generates vector embeddings for structured objects
+   - Enables semantic search across structured data
+   - Supports hybrid search combining structured queries with semantic similarity
+   - Integrates with existing vector stores
+
+   Module: trustgraph-flow/trustgraph/embeddings/object_embeddings/qdrant
+
+### Data Models
+
+#### Schema Storage Mechanism
+
+Schemas are stored in TrustGraph's configuration system using the following structure:
+
+- **Type**: `schema` (fixed value for all structured data schemas)
+- **Key**: The unique name/identifier of the schema (e.g., `customer_records`, `transaction_log`)
+- **Value**: JSON schema definition containing the structure
+
+Example configuration entry:
+```
+Type: schema
+Key: customer_records
+Value: {
+  "name": "customer_records",
+  "description": "Customer information table",
+  "fields": [
+    {
+      "name": "customer_id",
+      "type": "string",
+      "primary_key": true
+    },
+    {
+      "name": "name",
+      "type": "string",
+      "required": true
+    },
+    {
+      "name": "email",
+      "type": "string",
+      "required": true
+    },
+    {
+      "name": "registration_date",
+      "type": "timestamp"
+    },
+    {
+      "name": "status",
+      "type": "string",
+      "enum": ["active", "inactive", "suspended"]
+    }
+  ],
+  "indexes": ["email", "registration_date"]
+}
+```
+
+This approach allows:
+- Dynamic schema definition without code changes
+- Easy schema updates and versioning
+- Consistent integration with existing TrustGraph configuration management
+- Support for multiple schemas within a single deployment
+
+### APIs
+
+New APIs:
+  - Pulsar schemas for above types
+  - Pulsar interfaces in new flows
+  - Need a means to specify schema types in flows so that flows know which
+    schema types to load
+  - APIs added to gateway and rev-gateway
+
+Modified APIs:
+- Knowledge extraction endpoints - Add structured object output option
+- Agent endpoints - Add structured data tool support
+
+### Implementation Details
+
+Following existing conventions - these are just new processing modules.
+Everything is in the trustgraph-flow packages except for schema items
+in trustgraph-base.
+
+Need some UI work in the Workbench to be able to demo / pilot this
+capability.
+
+## Security Considerations
+
+No extra considerations.
+
+## Performance Considerations
+
+Some questions around using Cassandra queries and indexes so that queries
+don't slow down.
+
+## Testing Strategy
+
+Use existing test strategy, will build unit, contract and integration tests.
+
+## Migration Plan
+
+None.
+
+## Timeline
+
+Not specified.
+
+## Open Questions
+
+- Can this be made to work with other store types?  We're aiming to use
+  interfaces which make modules which work with one store applicable to
+  other stores.
+
+## References
+
+n/a.
+
--- a/docs/tech-specs/STRUCTURED_DATA_SCHEMAS.md
+++ b/docs/tech-specs/STRUCTURED_DATA_SCHEMAS.md
@ -0,0 +1,139 @@
+# Structured Data Pulsar Schema Changes
+
+## Overview
+
+Based on the STRUCTURED_DATA.md specification, this document proposes the necessary Pulsar schema additions and modifications to support structured data capabilities in TrustGraph.
+
+## Required Schema Changes
+
+### 1. Core Schema Enhancements
+
+#### Enhanced Field Definition
+The existing `Field` class in `core/primitives.py` needs additional properties:
+
+```python
+class Field(Record):
+    name = String()
+    type = String()  # int, string, long, bool, float, double, timestamp
+    size = Integer()
+    primary = Boolean()
+    description = String()
+    # NEW FIELDS:
+    required = Boolean()  # Whether field is required
+    enum_values = Array(String())  # For enum type fields
+    indexed = Boolean()  # Whether field should be indexed
+```
+
+### 2. New Knowledge Schemas
+
+#### 2.1 Structured Data Submission
+New file: `knowledge/structured.py`
+
+```python
+from pulsar.schema import Record, String, Bytes, Map
+from ..core.metadata import Metadata
+
+class StructuredDataSubmission(Record):
+    metadata = Metadata()
+    format = String()  # "json", "csv", "xml"
+    schema_name = String()  # Reference to schema in config
+    data = Bytes()  # Raw data to ingest
+    options = Map(String())  # Format-specific options
+```
+
+### 3. New Service Schemas
+
+#### 3.1 NLP to Structured Query Service
+New file: `services/nlp_query.py`
+
+```python
+from pulsar.schema import Record, String, Array, Map, Integer, Double
+from ..core.primitives import Error
+
+class NLPToStructuredQueryRequest(Record):
+    natural_language_query = String()
+    max_results = Integer()
+    context_hints = Map(String())  # Optional context for query generation
+
+class NLPToStructuredQueryResponse(Record):
+    error = Error()
+    graphql_query = String()  # Generated GraphQL query
+    variables = Map(String())  # GraphQL variables if any
+    detected_schemas = Array(String())  # Which schemas the query targets
+    confidence = Double()
+```
+
+#### 3.2 Structured Query Service
+New file: `services/structured_query.py`
+
+```python
+from pulsar.schema import Record, String, Map, Array
+from ..core.primitives import Error
+
+class StructuredQueryRequest(Record):
+    query = String()  # GraphQL query
+    variables = Map(String())  # GraphQL variables
+    operation_name = String()  # Optional operation name for multi-operation documents
+
+class StructuredQueryResponse(Record):
+    error = Error()
+    data = String()  # JSON-encoded GraphQL response data
+    errors = Array(String())  # GraphQL errors if any
+```
+
+#### 2.2 Object Extraction Output
+New file: `knowledge/object.py`
+
+```python
+from pulsar.schema import Record, String, Map, Double
+from ..core.metadata import Metadata
+
+class ExtractedObject(Record):
+    metadata = Metadata()
+    schema_name = String()  # Which schema this object belongs to
+    values = Map(String())  # Field name -> value
+    confidence = Double()
+    source_span = String()  # Text span where object was found
+```
+
+### 4. Enhanced Knowledge Schemas
+
+#### 4.1 Object Embeddings Enhancement
+Update `knowledge/embeddings.py` to support structured object embeddings better:
+
+```python
+class StructuredObjectEmbedding(Record):
+    metadata = Metadata()
+    vectors = Array(Array(Double()))
+    schema_name = String()
+    object_id = String()  # Primary key value
+    field_embeddings = Map(Array(Double()))  # Per-field embeddings
+```
+
+## Integration Points
+
+### Flow Integration
+
+The schemas will be used by new flow modules:
+- `trustgraph-flow/trustgraph/decoding/structured` - Uses StructuredDataSubmission
+- `trustgraph-flow/trustgraph/query/nlp_query/cassandra` - Uses NLP query schemas
+- `trustgraph-flow/trustgraph/query/objects/cassandra` - Uses structured query schemas
+- `trustgraph-flow/trustgraph/extract/object/row/` - Consumes Chunk, produces ExtractedObject
+- `trustgraph-flow/trustgraph/storage/objects/cassandra` - Uses Rows schema
+- `trustgraph-flow/trustgraph/embeddings/object_embeddings/qdrant` - Uses object embedding schemas
+
+## Implementation Notes
+
+1. **Schema Versioning**: Consider adding a `version` field to RowSchema for future migration support
+2. **Type System**: The `Field.type` should support all Cassandra native types
+3. **Batch Operations**: Most services should support both single and batch operations
+4. **Error Handling**: Consistent error reporting across all new services
+5. **Backwards Compatibility**: Existing schemas remain unchanged except for minor Field enhancements
+
+## Next Steps
+
+1. Implement schema files in the new structure
+2. Update existing services to recognize new schema types
+3. Implement flow modules that use these schemas
+4. Add gateway/rev-gateway endpoints for new services
+5. Create unit tests for schema validation
--- a/product-platform-diagram.svg
+++ b/product-platform-diagram.svg
--- a/tg-adapter.png
+++ b/tg-adapter.png
--- a/trustgraph-flow/trustgraph/query/doc_embeddings/qdrant/service.py
+++ b/trustgraph-flow/trustgraph/query/doc_embeddings/qdrant/service.py
@ -38,6 +38,28 @@ class Processor(DocumentEmbeddingsQueryService):
        )

        self.qdrant = QdrantClient(url=store_uri, api_key=api_key)
+        self.last_collection = None
+
+    def ensure_collection_exists(self, collection, dim):
+        """Ensure collection exists, create if it doesn't"""
+        if collection != self.last_collection:
+            if not self.qdrant.collection_exists(collection):
+                try:
+                    self.qdrant.create_collection(
+                        collection_name=collection,
+                        vectors_config=VectorParams(
+                            size=dim, distance=Distance.COSINE
+                        ),
+                    )
+                    logger.info(f"Created collection: {collection}")
+                except Exception as e:
+                    logger.error(f"Qdrant collection creation failed: {e}")
+                    raise e
+            self.last_collection = collection
+
+    def collection_exists(self, collection):
+        """Check if collection exists (no implicit creation)"""
+        return self.qdrant.collection_exists(collection)

    def collection_exists(self, collection):
        """Check if collection exists (no implicit creation)"""
--- a/trustgraph-flow/trustgraph/query/graph_embeddings/qdrant/service.py
+++ b/trustgraph-flow/trustgraph/query/graph_embeddings/qdrant/service.py
@ -38,6 +38,28 @@ class Processor(GraphEmbeddingsQueryService):
        )

        self.qdrant = QdrantClient(url=store_uri, api_key=api_key)
+        self.last_collection = None
+
+    def ensure_collection_exists(self, collection, dim):
+        """Ensure collection exists, create if it doesn't"""
+        if collection != self.last_collection:
+            if not self.qdrant.collection_exists(collection):
+                try:
+                    self.qdrant.create_collection(
+                        collection_name=collection,
+                        vectors_config=VectorParams(
+                            size=dim, distance=Distance.COSINE
+                        ),
+                    )
+                    logger.info(f"Created collection: {collection}")
+                except Exception as e:
+                    logger.error(f"Qdrant collection creation failed: {e}")
+                    raise e
+            self.last_collection = collection
+
+    def collection_exists(self, collection):
+        """Check if collection exists (no implicit creation)"""
+        return self.qdrant.collection_exists(collection)

    def collection_exists(self, collection):
        """Check if collection exists (no implicit creation)"""