release/v1.4 -> master (#548)

2026-04-27 17:36:23 +02:00 · 2025-10-06 17:54:26 +01:00 · 2025-10-06 17:54:26 +01:00 · 2bd68ed7f4
commit 2bd68ed7f4
parent 3ec2cd54f9
94 changed files with 8571 additions and 1740 deletions
--- a/docs/tech-specs/cassandra-performance-refactor.md
+++ b/docs/tech-specs/cassandra-performance-refactor.md
@ -158,17 +158,17 @@ The current primary key `PRIMARY KEY (collection, s, p, o)` provides minimal clu
 - Uneven load distribution across cluster nodes
 - Scalability bottlenecks as collections grow

-## Proposed Solution: Multi-Table Denormalization Strategy
+## Proposed Solution: 4-Table Denormalization Strategy

 ### Overview

-Replace the single `triples` table with three purpose-built tables, each optimized for specific query patterns. This eliminates the need for secondary indexes and ALLOW FILTERING while providing optimal performance for all query types.
+Replace the single `triples` table with four purpose-built tables, each optimized for specific query patterns. This eliminates the need for secondary indexes and ALLOW FILTERING while providing optimal performance for all query types. The fourth table enables efficient collection deletion despite compound partition keys.

 ### New Schema Design

-**Table 1: Subject-Centric Queries**
+**Table 1: Subject-Centric Queries (triples_s)**
 ```sql
-CREATE TABLE triples_by_subject (
+CREATE TABLE triples_s (
    collection text,
    s text,
    p text,
@ -176,13 +176,13 @@ CREATE TABLE triples_by_subject (
    PRIMARY KEY ((collection, s), p, o)
 );
 ```
- **Optimizes:** get_s, get_sp, get_spo, get_os
+- **Optimizes:** get_s, get_sp, get_os
 - **Partition Key:** (collection, s) - Better distribution than collection alone
 - **Clustering:** (p, o) - Enables efficient predicate/object lookups for a subject

-**Table 2: Predicate-Object Queries**
+**Table 2: Predicate-Object Queries (triples_p)**
 ```sql
-CREATE TABLE triples_by_po (
+CREATE TABLE triples_p (
    collection text,
    p text,
    o text,
@ -194,9 +194,9 @@ CREATE TABLE triples_by_po (
 - **Partition Key:** (collection, p) - Direct access by predicate
 - **Clustering:** (o, s) - Efficient object-subject traversal

-**Table 3: Object-Centric Queries**
+**Table 3: Object-Centric Queries (triples_o)**
 ```sql
-CREATE TABLE triples_by_object (
+CREATE TABLE triples_o (
    collection text,
    o text,
    s text,
@ -204,30 +204,72 @@ CREATE TABLE triples_by_object (
    PRIMARY KEY ((collection, o), s, p)
 );
 ```
- **Optimizes:** get_o, get_os
+- **Optimizes:** get_o
 - **Partition Key:** (collection, o) - Direct access by object
 - **Clustering:** (s, p) - Efficient subject-predicate traversal

+**Table 4: Collection Management & SPO Queries (triples_collection)**
+```sql
+CREATE TABLE triples_collection (
+    collection text,
+    s text,
+    p text,
+    o text,
+    PRIMARY KEY (collection, s, p, o)
+);
+```
+- **Optimizes:** get_spo, delete_collection
+- **Partition Key:** collection only - Enables efficient collection-level operations
+- **Clustering:** (s, p, o) - Standard triple ordering
+- **Purpose:** Dual use for exact SPO lookups and as deletion index
+
 ### Query Mapping

 | Original Query | Target Table | Performance Improvement |
 |----------------|-------------|------------------------|
-| get_all(collection) | triples_by_subject | Token-based pagination |
-| get_s(collection, s) | triples_by_subject | Direct partition access |
-| get_p(collection, p) | triples_by_po | Direct partition access |
-| get_o(collection, o) | triples_by_object | Direct partition access |
-| get_sp(collection, s, p) | triples_by_subject | Partition + clustering |
-| get_po(collection, p, o) | triples_by_po | **No more ALLOW FILTERING!** |
-| get_os(collection, o, s) | triples_by_subject | Partition + clustering |
-| get_spo(collection, s, p, o) | triples_by_subject | Exact key lookup |
+| get_all(collection) | triples_s | ALLOW FILTERING (acceptable for scan) |
+| get_s(collection, s) | triples_s | Direct partition access |
+| get_p(collection, p) | triples_p | Direct partition access |
+| get_o(collection, o) | triples_o | Direct partition access |
+| get_sp(collection, s, p) | triples_s | Partition + clustering |
+| get_po(collection, p, o) | triples_p | **No more ALLOW FILTERING!** |
+| get_os(collection, o, s) | triples_o | Partition + clustering |
+| get_spo(collection, s, p, o) | triples_collection | Exact key lookup |
+| delete_collection(collection) | triples_collection | Read index, batch delete all |
+
+### Collection Deletion Strategy
+
+With compound partition keys, we cannot simply execute `DELETE FROM table WHERE collection = ?`. Instead:
+
+1. **Read Phase:** Query `triples_collection` to enumerate all triples:
+   ```sql
+   SELECT s, p, o FROM triples_collection WHERE collection = ?
+   ```
+   This is efficient since `collection` is the partition key for this table.
+
+2. **Delete Phase:** For each triple (s, p, o), delete from all 4 tables using full partition keys:
+   ```sql
+   DELETE FROM triples_s WHERE collection = ? AND s = ? AND p = ? AND o = ?
+   DELETE FROM triples_p WHERE collection = ? AND p = ? AND o = ? AND s = ?
+   DELETE FROM triples_o WHERE collection = ? AND o = ? AND s = ? AND p = ?
+   DELETE FROM triples_collection WHERE collection = ? AND s = ? AND p = ? AND o = ?
+   ```
+   Batched in groups of 100 for efficiency.
+
+**Trade-off Analysis:**
+- ✅ Maintains optimal query performance with distributed partitions
+- ✅ No hot partitions for large collections
+- ❌ More complex deletion logic (read-then-delete)
+- ❌ Deletion time proportional to collection size

 ### Benefits

-1. **Eliminates ALLOW FILTERING** - Every query has an optimal access path
+1. **Eliminates ALLOW FILTERING** - Every query has an optimal access path (except get_all scan)
 2. **No Secondary Indexes** - Each table IS the index for its query pattern
 3. **Better Data Distribution** - Composite partition keys spread load effectively
 4. **Predictable Performance** - Query time proportional to result size, not total data
 5. **Leverages Cassandra Strengths** - Designed for Cassandra's architecture
+6. **Enables Collection Deletion** - triples_collection serves as deletion index

 ## Implementation Plan

@ -295,10 +337,11 @@ def delete_collection(self, collection) -> None  # Delete from all three tables
 ### Implementation Strategy

 #### Phase 1: Schema and Core Methods
-1. **Rewrite `init()` method** - Create three tables instead of one
-2. **Rewrite `insert()` method** - Batch writes to all three tables
+1. **Rewrite `init()` method** - Create four tables instead of one
+2. **Rewrite `insert()` method** - Batch writes to all four tables
 3. **Implement prepared statements** - For optimal performance
 4. **Add table routing logic** - Direct queries to optimal tables
+5. **Implement collection deletion** - Read from triples_collection, batch delete from all tables

 #### Phase 2: Query Method Optimization
 1. **Rewrite each get_* method** to use optimal table
@ -318,18 +361,11 @@ def delete_collection(self, collection) -> None  # Delete from all three tables
 def insert(self, collection, s, p, o):
    batch = BatchStatement()

-    # Insert into all three tables
-    batch.add(SimpleStatement(
-        "INSERT INTO triples_by_subject (collection, s, p, o) VALUES (?, ?, ?, ?)"
-    ), (collection, s, p, o))
-
-    batch.add(SimpleStatement(
-        "INSERT INTO triples_by_po (collection, p, o, s) VALUES (?, ?, ?, ?)"
-    ), (collection, p, o, s))
-
-    batch.add(SimpleStatement(
-        "INSERT INTO triples_by_object (collection, o, s, p) VALUES (?, ?, ?, ?)"
-    ), (collection, o, s, p))
+    # Insert into all four tables
+    batch.add(self.insert_subject_stmt, (collection, s, p, o))
+    batch.add(self.insert_po_stmt, (collection, p, o, s))
+    batch.add(self.insert_object_stmt, (collection, o, s, p))
+    batch.add(self.insert_collection_stmt, (collection, s, p, o))

    self.session.execute(batch)
 ```
@ -337,11 +373,65 @@ def insert(self, collection, s, p, o):
 #### Query Routing Logic
 ```python
 def get_po(self, collection, p, o, limit=10):
-    # Route to triples_by_po table - NO ALLOW FILTERING!
+    # Route to triples_p table - NO ALLOW FILTERING!
    return self.session.execute(
-        "SELECT s FROM triples_by_po WHERE collection = ? AND p = ? AND o = ? LIMIT ?",
+        self.get_po_stmt,
        (collection, p, o, limit)
    )
+
+def get_spo(self, collection, s, p, o, limit=10):
+    # Route to triples_collection table for exact SPO lookup
+    return self.session.execute(
+        self.get_spo_stmt,
+        (collection, s, p, o, limit)
+    )
+```
+
+#### Collection Deletion Logic
+```python
+def delete_collection(self, collection):
+    # Step 1: Read all triples from collection table
+    rows = self.session.execute(
+        f"SELECT s, p, o FROM {self.collection_table} WHERE collection = %s",
+        (collection,)
+    )
+
+    # Step 2: Batch delete from all 4 tables
+    batch = BatchStatement()
+    count = 0
+
+    for row in rows:
+        s, p, o = row.s, row.p, row.o
+
+        # Delete using full partition keys for each table
+        batch.add(SimpleStatement(
+            f"DELETE FROM {self.subject_table} WHERE collection = ? AND s = ? AND p = ? AND o = ?"
+        ), (collection, s, p, o))
+
+        batch.add(SimpleStatement(
+            f"DELETE FROM {self.po_table} WHERE collection = ? AND p = ? AND o = ? AND s = ?"
+        ), (collection, p, o, s))
+
+        batch.add(SimpleStatement(
+            f"DELETE FROM {self.object_table} WHERE collection = ? AND o = ? AND s = ? AND p = ?"
+        ), (collection, o, s, p))
+
+        batch.add(SimpleStatement(
+            f"DELETE FROM {self.collection_table} WHERE collection = ? AND s = ? AND p = ? AND o = ?"
+        ), (collection, s, p, o))
+
+        count += 1
+
+        # Execute every 100 triples to avoid oversized batches
+        if count % 100 == 0:
+            self.session.execute(batch)
+            batch = BatchStatement()
+
+    # Execute remaining deletions
+    if count % 100 != 0:
+        self.session.execute(batch)
+
+    logger.info(f"Deleted {count} triples from collection {collection}")
 ```

 #### Prepared Statement Optimization
@ -349,12 +439,18 @@ def get_po(self, collection, p, o, limit=10):
 def prepare_statements(self):
    # Cache prepared statements for better performance
    self.insert_subject_stmt = self.session.prepare(
-        "INSERT INTO triples_by_subject (collection, s, p, o) VALUES (?, ?, ?, ?)"
+        f"INSERT INTO {self.subject_table} (collection, s, p, o) VALUES (?, ?, ?, ?)"
    )
    self.insert_po_stmt = self.session.prepare(
-        "INSERT INTO triples_by_po (collection, p, o, s) VALUES (?, ?, ?, ?)"
+        f"INSERT INTO {self.po_table} (collection, p, o, s) VALUES (?, ?, ?, ?)"
    )
-    # ... etc for all tables and queries
+    self.insert_object_stmt = self.session.prepare(
+        f"INSERT INTO {self.object_table} (collection, o, s, p) VALUES (?, ?, ?, ?)"
+    )
+    self.insert_collection_stmt = self.session.prepare(
+        f"INSERT INTO {self.collection_table} (collection, s, p, o) VALUES (?, ?, ?, ?)"
+    )
+    # ... query statements
 ```

 ## Migration Strategy
@ -511,9 +607,10 @@ def rollback_to_legacy():
 ## Risks and Considerations

 ### Performance Risks
- **Write latency increase** - 3x write operations per insert
- **Storage overhead** - 3x storage requirement
+- **Write latency increase** - 4x write operations per insert (33% more than 3-table approach)
+- **Storage overhead** - 4x storage requirement (33% more than 3-table approach)
 - **Batch write failures** - Need proper error handling
+- **Deletion complexity** - Collection deletion requires read-then-delete loop

 ### Operational Risks
 - **Migration complexity** - Data migration for large datasets
--- a/docs/tech-specs/collection-management.md
+++ b/docs/tech-specs/collection-management.md
@ -2,16 +2,17 @@

 ## Overview

-This specification describes the collection management capabilities for TrustGraph, enabling users to have explicit control over collections that are currently implicitly created during data loading and querying operations. The feature supports four primary use cases:
+This specification describes the collection management capabilities for TrustGraph, requiring explicit collection creation and providing direct control over the collection lifecycle. Collections must be explicitly created before use, ensuring proper synchronization between the librarian metadata and all storage backends. The feature supports four primary use cases:

-1. **Collection Listing**: View all existing collections in the system
-2. **Collection Deletion**: Remove unwanted collections and their associated data
-3. **Collection Labeling**: Associate descriptive labels with collections for better organization
-4. **Collection Tagging**: Apply tags to collections for categorization and easier discovery
+1. **Collection Creation**: Explicitly create collections before storing data
+2. **Collection Listing**: View all existing collections in the system
+3. **Collection Metadata Management**: Update collection names, descriptions, and tags
+4. **Collection Deletion**: Remove collections and their associated data across all storage types

 ## Goals

- **Explicit Collection Control**: Provide users with direct management capabilities over collections beyond implicit creation
+- **Explicit Collection Creation**: Require collections to be created before data can be stored
+- **Storage Synchronization**: Ensure collections exist in all storage backends (vectors, objects, triples)
 - **Collection Visibility**: Enable users to list and inspect all collections in their environment
 - **Collection Cleanup**: Allow deletion of collections that are no longer needed
 - **Collection Organization**: Support labels and tags for better collection tracking and discovery
@ -19,22 +20,25 @@ This specification describes the collection management capabilities for TrustGra
 - **Collection Discovery**: Make it easier to find specific collections through filtering and search
 - **Operational Transparency**: Provide clear visibility into collection lifecycle and usage
 - **Resource Management**: Enable cleanup of unused collections to optimize resource utilization
+- **Data Integrity**: Prevent orphaned collections in storage without metadata tracking

 ## Background

-Currently, collections in TrustGraph are implicitly created during data loading operations and query execution. While this provides convenience for users, it lacks the explicit control needed for production environments and long-term data management.
+Previously, collections in TrustGraph were implicitly created during data loading operations, leading to synchronization issues where collections could exist in storage backends without corresponding metadata in the librarian. This created management challenges and potential orphaned data.

-Current limitations include:
- No way to list existing collections
- No mechanism to delete unwanted collections
- No ability to associate metadata with collections for tracking purposes
- Difficulty in organizing and discovering collections over time
+The explicit collection creation model addresses these issues by:
+- Requiring collections to be created before use via `tg-set-collection`
+- Broadcasting collection creation to all storage backends
+- Maintaining synchronized state between librarian metadata and storage
+- Preventing writes to non-existent collections
+- Providing clear collection lifecycle management

-This specification addresses these gaps by introducing explicit collection management operations. By providing collection management APIs and commands, TrustGraph can:
- Give users full control over their collection lifecycle
- Enable better organization through labels and tags
- Support collection cleanup for resource optimization
- Improve operational visibility and management
+This specification defines the explicit collection management model. By requiring explicit collection creation, TrustGraph ensures:
+- Collections are tracked in librarian metadata from creation
+- All storage backends are aware of collections before receiving data
+- No orphaned collections exist in storage
+- Clear operational visibility and control over collection lifecycle
+- Consistent error handling when operations reference non-existent collections

 ## Technical Design

@ -98,24 +102,52 @@ This approach allows:

 #### Collection Lifecycle

-Collections follow a lazy-creation pattern that aligns with existing TrustGraph behavior:
+Collections are explicitly created in the librarian before data operations can proceed:

-1. **Lazy Creation**: Collections are automatically created when first referenced during data loading or query operations. No explicit create operation is needed.
+1. **Collection Creation** (Two Paths):

-2. **Implicit Registration**: When a collection is used (data loading, querying), the system checks if a metadata record exists. If not, a new record is created with default values:
-   - `name`: defaults to collection_id
-   - `description`: empty
-   - `tags`: empty set
-   - `created_at`: current timestamp
+   **Path A: User-Initiated Creation** via `tg-set-collection`:
+   - User provides collection ID, name, description, and tags
+   - Librarian creates metadata record in `collections` table
+   - Librarian broadcasts "create-collection" to all storage backends
+   - All storage processors create collection and confirm success
+   - Collection is now ready for data operations

-3. **Explicit Updates**: Users can update collection metadata (name, description, tags) through management operations after lazy creation.
+   **Path B: Automatic Creation on Document Submission**:
+   - User submits document specifying a collection ID
+   - Librarian checks if collection exists in metadata table
+   - If not exists: Librarian creates metadata with defaults (name=collection_id, empty description/tags)
+   - Librarian broadcasts "create-collection" to all storage backends
+   - All storage processors create collection and confirm success
+   - Document processing proceeds with collection now established

-4. **Explicit Deletion**: Users can delete collections, which removes both the metadata record and the underlying collection data across all store types.
+   Both paths ensure collection exists in librarian metadata AND all storage backends before data operations.

-5. **Multi-Store Deletion**: Collection deletion cascades across all storage backends (vector stores, object stores, triple stores) as each implements lazy creation and must support collection deletion.
+2. **Storage Validation**: Write operations validate collection exists:
+   - Storage processors check collection state before accepting writes
+   - Writes to non-existent collections return error
+   - This prevents direct writes bypassing the librarian's collection creation logic
+
+3. **Query Behavior**: Query operations handle non-existent collections gracefully:
+   - Queries to non-existent collections return empty results
+   - No error thrown for query operations
+   - Allows exploration without requiring collection to exist
+
+4. **Metadata Updates**: Users can update collection metadata after creation:
+   - Update name, description, and tags via `tg-set-collection`
+   - Updates apply to librarian metadata only
+   - Storage backends maintain collection but metadata updates don't propagate
+
+5. **Explicit Deletion**: Users delete collections via `tg-delete-collection`:
+   - Librarian broadcasts "delete-collection" to all storage backends
+   - Waits for confirmation from all storage processors
+   - Deletes librarian metadata record only after storage cleanup complete
+   - Ensures no orphaned data remains in storage
+
+**Key Principle**: The librarian is the single point of control for collection creation. Whether initiated by user command or document submission, the librarian ensures proper metadata tracking and storage backend synchronization before allowing data operations.

 Operations required:
- **Collection Use Notification**: Internal operation triggered during data loading/querying to ensure metadata record exists
+- **Create Collection**: User operation via `tg-set-collection` OR automatic on document submission
 - **Update Collection Metadata**: User operation to modify name, description, and tags
 - **Delete Collection**: User operation to remove collection and its data across all stores
 - **List Collections**: User operation to view collections with filtering by tags
@ -123,32 +155,65 @@ Operations required:
 #### Multi-Store Collection Management

 Collections exist across multiple storage backends in TrustGraph:
- **Vector Stores**: Store embeddings and vector data for collections
- **Object Stores**: Store documents and file data for collections
- **Triple Stores**: Store graph/RDF data for collections
+- **Vector Stores** (Qdrant, Milvus, Pinecone): Store embeddings and vector data
+- **Object Stores** (Cassandra): Store documents and file data
+- **Triple Stores** (Cassandra, Neo4j, Memgraph, FalkorDB): Store graph/RDF data

 Each store type implements:
- **Lazy Creation**: Collections are created implicitly when data is first stored
- **Collection Deletion**: Store-specific deletion operations to remove collection data
+- **Collection State Tracking**: Maintain knowledge of which collections exist
+- **Collection Creation**: Accept and process "create-collection" operations
+- **Collection Validation**: Check collection exists before accepting writes
+- **Collection Deletion**: Remove all data for specified collection

-The librarian service coordinates collection operations across all store types, ensuring consistent collection lifecycle management.
+The librarian service coordinates collection operations across all store types, ensuring:
+- Collections created in all backends before use
+- All backends confirm creation before returning success
+- Synchronized collection lifecycle across storage types
+- Consistent error handling when collections don't exist
+
+#### Collection State Tracking by Storage Type
+
+Each storage backend tracks collection state differently based on its capabilities:
+
+**Cassandra Triple Store:**
+- Uses existing `triples_collection` table
+- Creates system marker triple when collection created
+- Query: `SELECT collection FROM triples_collection WHERE collection = ? LIMIT 1`
+- Efficient single-partition check for collection existence
+
+**Qdrant/Milvus/Pinecone Vector Stores:**
+- Native collection APIs provide existence checking
+- Collections created with proper vector configuration
+- `collection_exists()` method uses storage API
+- Collection creation validates dimension requirements
+
+**Neo4j/Memgraph/FalkorDB Graph Stores:**
+- Use `:CollectionMetadata` nodes to track collections
+- Node properties: `{user, collection, created_at}`
+- Query: `MATCH (c:CollectionMetadata {user: $user, collection: $collection})`
+- Separate from data nodes for clean separation
+- Enables efficient collection listing and validation
+
+**Cassandra Object Store:**
+- Uses collection metadata table or marker rows
+- Similar pattern to triple store
+- Validates collection before document writes

 ### APIs

-New APIs:
+Collection Management APIs (Librarian):
+- **Create/Update Collection**: Create new collection or update existing metadata via `tg-set-collection`
 - **List Collections**: Retrieve collections for a user with optional tag filtering
- **Update Collection Metadata**: Modify collection name, description, and tags
- **Delete Collection**: Remove collection and associated data with confirmation, cascading to all store types
- **Collection Use Notification** (Internal): Ensure metadata record exists when collection is referenced
+- **Delete Collection**: Remove collection and associated data, cascading to all store types

-Store Writer APIs (Enhanced):
- **Vector Store Collection Deletion**: Remove vector data for specified user and collection
- **Object Store Collection Deletion**: Remove object/document data for specified user and collection
- **Triple Store Collection Deletion**: Remove graph/RDF data for specified user and collection
+Storage Management APIs (All Storage Processors):
+- **Create Collection**: Handle "create-collection" operation, establish collection in storage
+- **Delete Collection**: Handle "delete-collection" operation, remove all collection data
+- **Collection Exists Check**: Internal validation before accepting write operations

-Modified APIs:
- **Data Loading APIs**: Enhanced to trigger collection use notification for lazy metadata creation
- **Query APIs**: Enhanced to trigger collection use notification and optionally include metadata in responses
+Data Operation APIs (Modified Behavior):
+- **Write APIs**: Validate collection exists before accepting data, return error if not
+- **Query APIs**: Return empty results for non-existent collections without error

 ### Implementation Details

@ -168,32 +233,35 @@ When a user initiates collection deletion through the librarian service:

 #### Collection Management Interface

-All store writers will implement a standardized collection management interface with a common schema across store types:
+All store writers implement a standardized collection management interface with a common schema:

-**Message Schema:**
+**Message Schema (`StorageManagementRequest`):**
 ```json
 {
-  "operation": "delete-collection",
+  "operation": "create-collection" | "delete-collection",
  "user": "user123",
-  "collection": "documents-2024",
-  "timestamp": "2024-01-15T10:30:00Z"
+  "collection": "documents-2024"
 }
 ```

 **Queue Architecture:**
- **Object Store Collection Management Queue**: Handles collection operations for object/document stores
- **Vector Store Collection Management Queue**: Handles collection operations for vector/embedding stores
- **Triple Store Collection Management Queue**: Handles collection operations for graph/RDF stores
+- **Vector Store Management Queue** (`vector-storage-management`): Vector/embedding stores
+- **Object Store Management Queue** (`object-storage-management`): Object/document stores
+- **Triple Store Management Queue** (`triples-storage-management`): Graph/RDF stores
+- **Storage Response Queue** (`storage-management-response`): All responses sent here

 Each store writer implements:
- **Collection Management Handler**: Separate from standard data storage handlers
- **Delete Collection Operation**: Removes all data associated with the specified collection
- **Message Processing**: Consumes from dedicated collection management queue
- **Status Reporting**: Returns success/failure status for coordination
- **Idempotent Operations**: Handles cases where collection doesn't exist (no-op)
+- **Collection Management Handler**: Processes `StorageManagementRequest` messages
+- **Create Collection Operation**: Establishes collection in storage backend
+- **Delete Collection Operation**: Removes all data associated with collection
+- **Collection State Tracking**: Maintains knowledge of which collections exist
+- **Message Processing**: Consumes from dedicated management queue
+- **Status Reporting**: Returns success/failure via `StorageManagementResponse`
+- **Idempotent Operations**: Safe to call create/delete multiple times

-**Initial Implementation:**
-Only `delete-collection` operation will be implemented initially. The interface supports future operations like `archive-collection`, `migrate-collection`, etc.
+**Supported Operations:**
+- `create-collection`: Create collection in storage backend
+- `delete-collection`: Remove all collection data from storage backend

 #### Cassandra Triple Store Refactor

@ -244,13 +312,11 @@ As part of this implementation, the Cassandra triple store will be refactored fr
   - Maintain same query logic with collection parameter

 **Benefits:**
- **Simplified Collection Deletion**: Simple `DELETE FROM triples WHERE collection = ?` instead of dropping tables
+- **Simplified Collection Deletion**: Delete using `collection` partition key across all 4 tables
 - **Resource Efficiency**: Fewer database connections and table objects
 - **Cross-Collection Operations**: Easier to implement operations spanning multiple collections
 - **Consistent Architecture**: Aligns with unified collection metadata approach
-
-**Migration Strategy:**
-Existing table-per-collection data will need migration to the new unified schema during the upgrade process.
+- **Collection Validation**: Easy to check collection existence via `triples_collection` table

 Collection operations will be atomic where possible and provide appropriate error handling and validation.

@ -264,37 +330,25 @@ Collection listing operations may need pagination for environments with large nu

 ## Testing Strategy

-Comprehensive testing will cover collection lifecycle operations, metadata management, and CLI command functionality with both unit and integration tests.
-
-## Migration Plan
-
-This implementation requires both metadata and storage migrations:
-
-### Collection Metadata Migration
-Existing collections will need to be registered in the new Cassandra collections metadata table. A migration process will:
- Scan existing keyspaces and tables to identify collections
- Create metadata records with default values (name=collection_id, empty description/tags)
- Preserve creation timestamps where possible
-
-### Cassandra Triple Store Migration
-The Cassandra storage refactor requires data migration from table-per-collection to unified table:
- **Pre-migration**: Identify all user keyspaces and collection tables
- **Data Transfer**: Copy triples from individual collection tables to unified "triples" table with collection
- **Schema Validation**: Ensure new primary key structure maintains query performance
- **Cleanup**: Remove old collection tables after successful migration
- **Rollback Plan**: Maintain ability to restore table-per-collection structure if needed
-
-Migration will be performed during a maintenance window to ensure data consistency.
+Comprehensive testing will cover:
+- Collection creation workflow end-to-end
+- Storage backend synchronization
+- Write validation for non-existent collections
+- Query handling of non-existent collections
+- Collection deletion cascade across all stores
+- Error handling and recovery scenarios
+- Unit tests for each storage backend
+- Integration tests for cross-store operations

 ## Implementation Status

 ### ✅ Completed Components

-1. **Librarian Collection Management Service** (`trustgraph-flow/trustgraph/librarian/collection_service.py`)
-   - Complete collection CRUD operations (list, update, delete)
+1. **Librarian Collection Management Service** (`trustgraph-flow/trustgraph/librarian/collection_manager.py`)
+   - Collection metadata CRUD operations (list, update, delete)
   - Cassandra collection metadata table integration via `LibraryTableStore`
-   - Async request/response handling with proper error management
   - Collection deletion cascade coordination across all storage types
+   - Async request/response handling with proper error management

 2. **Collection Metadata Schema** (`trustgraph-base/trustgraph/schema/services/collection.py`)
   - `CollectionManagementRequest` and `CollectionManagementResponse` schemas
@ -303,47 +357,70 @@ Migration will be performed during a maintenance window to ensure data consisten

 3. **Storage Management Schema** (`trustgraph-base/trustgraph/schema/services/storage.py`)
   - `StorageManagementRequest` and `StorageManagementResponse` schemas
+   - Storage management queue topics defined
   - Message format for storage-level collection operations

-### ❌ Missing Components
+4. **Cassandra 4-Table Schema** (`trustgraph-flow/trustgraph/direct/cassandra_kg.py`)
+   - Compound partition keys for query performance
+   - `triples_collection` table for SPO queries and deletion tracking
+   - Collection deletion implemented with read-then-delete pattern

-1. **Storage Management Queue Topics**
-   - Missing topic definitions in schema for:
-     - `vector_storage_management_topic`
-     - `object_storage_management_topic`
-     - `triples_storage_management_topic`
-     - `storage_management_response_topic`
-   - These are referenced by the librarian service but not yet defined
+### 🔄 In Progress Components

-2. **Store Collection Management Handlers**
-   - **Vector Store Writers** (Qdrant, Milvus, Pinecone): No collection deletion handlers
-   - **Object Store Writers** (Cassandra): No collection deletion handlers
-   - **Triple Store Writers** (Cassandra, Neo4j, Memgraph, FalkorDB): No collection deletion handlers
-   - Need to implement `StorageManagementRequest` processing in each store writer
+1. **Collection Creation Broadcast** (`trustgraph-flow/trustgraph/librarian/collection_manager.py`)
+   - Update `update_collection()` to send "create-collection" to storage backends
+   - Wait for confirmations from all storage processors
+   - Handle creation failures appropriately

-3. **Collection Management Interface Implementation**
-   - Store writers need collection management message consumers
-   - Collection deletion operations need to be implemented per store type
-   - Response handling back to librarian service
+2. **Document Submission Handler** (`trustgraph-flow/trustgraph/librarian/service.py` or similar)
+   - Check if collection exists when document submitted
+   - If not exists: Create collection with defaults before processing document
+   - Trigger same "create-collection" broadcast as `tg-set-collection`
+   - Ensure collection established before document flows to storage processors
+
+### ❌ Pending Components
+
+1. **Collection State Tracking** - Need to implement in each storage backend:
+   - **Cassandra Triples**: Use `triples_collection` table with marker triples
+   - **Neo4j/Memgraph/FalkorDB**: Create `:CollectionMetadata` nodes
+   - **Qdrant/Milvus/Pinecone**: Use native collection APIs
+   - **Cassandra Objects**: Add collection metadata tracking
+
+2. **Storage Management Handlers** - Need "create-collection" support in 12 files:
+   - `trustgraph-flow/trustgraph/storage/triples/cassandra/write.py`
+   - `trustgraph-flow/trustgraph/storage/triples/neo4j/write.py`
+   - `trustgraph-flow/trustgraph/storage/triples/memgraph/write.py`
+   - `trustgraph-flow/trustgraph/storage/triples/falkordb/write.py`
+   - `trustgraph-flow/trustgraph/storage/doc_embeddings/qdrant/write.py`
+   - `trustgraph-flow/trustgraph/storage/graph_embeddings/qdrant/write.py`
+   - `trustgraph-flow/trustgraph/storage/doc_embeddings/milvus/write.py`
+   - `trustgraph-flow/trustgraph/storage/graph_embeddings/milvus/write.py`
+   - `trustgraph-flow/trustgraph/storage/doc_embeddings/pinecone/write.py`
+   - `trustgraph-flow/trustgraph/storage/graph_embeddings/pinecone/write.py`
+   - `trustgraph-flow/trustgraph/storage/objects/cassandra/write.py`
+   - Plus any other storage implementations
+
+3. **Write Operation Validation** - Add collection existence checks to all `store_*` methods
+
+4. **Query Operation Handling** - Update queries to return empty for non-existent collections

 ### Next Implementation Steps

-1. **Define Storage Management Topics** in `trustgraph-base/trustgraph/schema/services/storage.py`
-2. **Implement Collection Management Handlers** in each storage writer:
-   - Add `StorageManagementRequest` consumers
-   - Implement collection deletion operations
-   - Add response producers for status reporting
-3. **Test End-to-End Collection Deletion** across all storage types
+**Phase 1: Core Infrastructure (2-3 days)**
+1. Add collection state tracking methods to all storage backends
+2. Implement `collection_exists()` and `create_collection()` methods

-## Timeline
+**Phase 2: Storage Handlers (1 week)**
+3. Add "create-collection" handlers to all storage processors
+4. Add write validation to reject non-existent collections
+5. Update query handling for non-existent collections

-Phase 1 (Storage Topics): 1-2 days
-Phase 2 (Store Handlers): 1-2 weeks depending on number of storage backends
-Phase 3 (Testing & Integration): 3-5 days
+**Phase 3: Collection Manager (2-3 days)**
+6. Update collection_manager to broadcast creates
+7. Implement response tracking and error handling

-## Open Questions
-
- Should collection deletion be soft or hard delete by default?
- What metadata fields should be required vs optional?
- Should we implement storage management handlers incrementally by store type?
+**Phase 4: Testing (3-5 days)**
+8. End-to-end testing of explicit creation workflow
+9. Test all storage backends
+10. Validate error handling and edge cases

--- a/docs/tech-specs/flow-class-definition.md
+++ b/docs/tech-specs/flow-class-definition.md
@ -6,7 +6,7 @@ A flow class defines a complete dataflow pattern template in the TrustGraph syst

 ## Structure

-A flow class definition consists of four main sections:
+A flow class definition consists of five main sections:

 ### 1. Class Section
 Defines shared service processors that are instantiated once per flow class. These processors handle requests from all flow instances of this class.
@ -15,7 +15,11 @@ Defines shared service processors that are instantiated once per flow class. The
 "class": {
  "service-name:{class}": {
    "request": "queue-pattern:{class}",
-    "response": "queue-pattern:{class}"
+    "response": "queue-pattern:{class}",
+    "settings": {
+      "setting-name": "fixed-value",
+      "parameterized-setting": "{parameter-name}"
+    }
  }
 }
 ```
@ -24,6 +28,7 @@ Defines shared service processors that are instantiated once per flow class. The
 - Shared across all flow instances of the same class
 - Typically expensive or stateless services (LLMs, embedding models)
 - Use `{class}` template variable for queue naming
+- Settings can be fixed values or parameterized with `{parameter-name}` syntax
 - Examples: `embeddings:{class}`, `text-completion:{class}`, `graph-rag:{class}`

 ### 2. Flow Section
@ -33,7 +38,11 @@ Defines flow-specific processors that are instantiated for each individual flow
 "flow": {
  "processor-name:{id}": {
    "input": "queue-pattern:{id}",
-    "output": "queue-pattern:{id}"
+    "output": "queue-pattern:{id}",
+    "settings": {
+      "setting-name": "fixed-value",
+      "parameterized-setting": "{parameter-name}"
+    }
  }
 }
 ```
@ -42,6 +51,7 @@ Defines flow-specific processors that are instantiated for each individual flow
 - Unique instance per flow
 - Handle flow-specific data and state
 - Use `{id}` template variable for queue naming
+- Settings can be fixed values or parameterized with `{parameter-name}` syntax
 - Examples: `chunker:{id}`, `pdf-decoder:{id}`, `kg-extract-relationships:{id}`

 ### 3. Interfaces Section
@ -72,7 +82,24 @@ Interfaces can take two forms:
 - **Service Interfaces**: Request/response patterns for services (`embeddings`, `text-completion`)
 - **Data Interfaces**: Fire-and-forget data flow connection points (`triples-store`, `entity-contexts-load`)

-### 4. Metadata
+### 4. Parameters Section
+Maps flow-specific parameter names to centrally-stored parameter definitions:
+
+```json
+"parameters": {
+  "model": "llm-model",
+  "temp": "temperature",
+  "chunk": "chunk-size"
+}
+```
+
+**Characteristics:**
+- Keys are parameter names used in processor settings (e.g., `{model}`)
+- Values reference parameter definitions stored in schema/config
+- Enables reuse of common parameter definitions across flows
+- Reduces duplication of parameter schemas
+
+### 5. Metadata
 Additional information about the flow class:

 ```json
@ -82,16 +109,98 @@ Additional information about the flow class:

 ## Template Variables

-### {id}
+### System Variables
+
+#### {id}
 - Replaced with the unique flow instance identifier
 - Creates isolated resources for each flow
 - Example: `flow-123`, `customer-A-flow`

-### {class}
+#### {class}
 - Replaced with the flow class name
 - Creates shared resources across flows of the same class
 - Example: `standard-rag`, `enterprise-rag`

+### Parameter Variables
+
+#### {parameter-name}
+- Custom parameters defined at flow launch time
+- Parameter names match keys in the flow's `parameters` section
+- Used in processor settings to customize behavior
+- Examples: `{model}`, `{temp}`, `{chunk}`
+- Replaced with values provided when launching the flow
+- Validated against centrally-stored parameter definitions
+
+## Processor Settings
+
+Settings provide configuration values to processors at instantiation time. They can be:
+
+### Fixed Settings
+Direct values that don't change:
+```json
+"settings": {
+  "model": "gemma3:12b",
+  "temperature": 0.7,
+  "max_retries": 3
+}
+```
+
+### Parameterized Settings
+Values that use parameters provided at flow launch:
+```json
+"settings": {
+  "model": "{model}",
+  "temperature": "{temp}",
+  "endpoint": "https://{region}.api.example.com"
+}
+```
+
+Parameter names in settings correspond to keys in the flow's `parameters` section.
+
+### Settings Examples
+
+**LLM Processor with Parameters:**
+```json
+// In parameters section:
+"parameters": {
+  "model": "llm-model",
+  "temp": "temperature",
+  "tokens": "max-tokens",
+  "key": "openai-api-key"
+}
+
+// In processor definition:
+"text-completion:{class}": {
+  "request": "non-persistent://tg/request/text-completion:{class}",
+  "response": "non-persistent://tg/response/text-completion:{class}",
+  "settings": {
+    "model": "{model}",
+    "temperature": "{temp}",
+    "max_tokens": "{tokens}",
+    "api_key": "{key}"
+  }
+}
+```
+
+**Chunker with Fixed and Parameterized Settings:**
+```json
+// In parameters section:
+"parameters": {
+  "chunk": "chunk-size"
+}
+
+// In processor definition:
+"chunker:{id}": {
+  "input": "persistent://tg/flow/chunk:{id}",
+  "output": "persistent://tg/flow/chunk-load:{id}",
+  "settings": {
+    "chunk_size": "{chunk}",
+    "chunk_overlap": 100,
+    "encoding": "utf-8"
+  }
+}
+```
+
 ## Queue Patterns (Pulsar)

 Flow classes use Apache Pulsar for messaging. Queue names follow the Pulsar format:
@ -137,15 +246,27 @@ All processors (both `{id}` and `{class}`) work together as a cohesive dataflow
 Given:
 - Flow Instance ID: `customer-A-flow`
 - Flow Class: `standard-rag`
+- Flow parameter mappings:
+  - `"model": "llm-model"`
+  - `"temp": "temperature"`
+  - `"chunk": "chunk-size"`
+- User-provided parameters:
+  - `model`: `gpt-4`
+  - `temp`: `0.5`
+  - `chunk`: `512`

 Template expansions:
 - `persistent://tg/flow/chunk-load:{id}` → `persistent://tg/flow/chunk-load:customer-A-flow`
 - `non-persistent://tg/request/embeddings:{class}` → `non-persistent://tg/request/embeddings:standard-rag`
+- `"model": "{model}"` → `"model": "gpt-4"`
+- `"temperature": "{temp}"` → `"temperature": "0.5"`
+- `"chunk_size": "{chunk}"` → `"chunk_size": "512"`

 This creates:
 - Isolated document processing pipeline for `customer-A-flow`
 - Shared embedding service for all `standard-rag` flows
 - Complete dataflow from document ingestion through querying
+- Processors configured with the provided parameter values

 ## Benefits

--- a/docs/tech-specs/flow-configurable-parameters.md
+++ b/docs/tech-specs/flow-configurable-parameters.md
@ -0,0 +1,485 @@
+# Flow Class Configurable Parameters Technical Specification
+
+## Overview
+
+This specification describes the implementation of configurable parameters for flow classes in TrustGraph. Parameters enable users to customize processor parameters at flow launch time by providing values that replace parameter placeholders in the flow class definition.
+
+Parameters work through template variable substitution in processor parameters, similar to how `{id}` and `{class}` variables work, but with user-provided values.
+
+The integration supports four primary use cases:
+
+1. **Model Selection**: Allowing users to choose different LLM models (e.g., `gemma3:8b`, `gpt-4`, `claude-3`) for processors
+2. **Resource Configuration**: Adjusting processor parameters like chunk sizes, batch sizes, and concurrency limits
+3. **Behavioral Tuning**: Modifying processor behavior through parameters like temperature, max-tokens, or retrieval thresholds
+4. **Environment-Specific Parameters**: Configuring endpoints, API keys, or region-specific URLs per deployment
+
+## Goals
+
+- **Dynamic Processor Configuration**: Enable runtime configuration of processor parameters through parameter substitution
+- **Parameter Validation**: Provide type checking and validation for parameters at flow launch time
+- **Default Values**: Support sensible defaults while allowing overrides for advanced users
+- **Template Substitution**: Seamlessly replace parameter placeholders in processor parameters
+- **UI Integration**: Enable parameter input through both API and UI interfaces
+- **Type Safety**: Ensure parameter types match expected processor parameter types
+- **Documentation**: Self-documenting parameter schemas within flow class definitions
+- **Backward Compatibility**: Maintain compatibility with existing flow classes that don't use parameters
+
+## Background
+
+Flow classes in TrustGraph now support processor parameters that can contain either fixed values or parameter placeholders. This creates an opportunity for runtime customization.
+
+Current processor parameters support:
+- Fixed values: `"model": "gemma3:12b"`
+- Parameter placeholders: `"model": "gemma3:{model-size}"`
+
+This specification defines how parameters are:
+- Declared in flow class definitions
+- Validated when flows are launched
+- Substituted in processor parameters
+- Exposed through APIs and UI
+
+By leveraging parameterized processor parameters, TrustGraph can:
+- Reduce flow class duplication by using parameters for variations
+- Enable users to tune processor behavior without modifying definitions
+- Support environment-specific configurations through parameter values
+- Maintain type safety through parameter schema validation
+
+## Technical Design
+
+### Architecture
+
+The configurable parameters system requires the following technical components:
+
+1. **Parameter Schema Definition**
+   - JSON Schema-based parameter definitions within flow class metadata
+   - Type definitions including string, number, boolean, enum, and object types
+   - Validation rules including min/max values, patterns, and required fields
+
+   Module: trustgraph-flow/trustgraph/flow/definition.py
+
+2. **Parameter Resolution Engine**
+   - Runtime parameter validation against schema
+   - Default value application for unspecified parameters
+   - Parameter injection into flow execution context
+   - Type coercion and conversion as needed
+
+   Module: trustgraph-flow/trustgraph/flow/parameter_resolver.py
+
+3. **Parameter Store Integration**
+   - Retrieval of parameter definitions from schema/config store
+   - Caching of frequently-used parameter definitions
+   - Validation against centrally-stored schemas
+
+   Module: trustgraph-flow/trustgraph/flow/parameter_store.py
+
+4. **Flow Launcher Extensions**
+   - API extensions to accept parameter values during flow launch
+   - Parameter mapping resolution (flow names to definition names)
+   - Error handling for invalid parameter combinations
+
+   Module: trustgraph-flow/trustgraph/flow/launcher.py
+
+5. **UI Parameter Forms**
+   - Dynamic form generation from flow parameter metadata
+   - Ordered parameter display using `order` field
+   - Descriptive parameter labels using `description` field
+   - Input validation against parameter type definitions
+   - Parameter presets and templates
+
+   Module: trustgraph-ui/components/flow-parameters/
+
+### Data Models
+
+#### Parameter Definitions (Stored in Schema/Config)
+
+Parameter definitions are stored centrally in the schema and config system with type "parameter-types":
+
+```json
+{
+  "llm-model": {
+    "type": "string",
+    "description": "LLM model to use",
+    "default": "gpt-4",
+    "enum": [
+      {
+        "id": "gpt-4",
+        "description": "OpenAI GPT-4 (Most Capable)"
+      },
+      {
+        "id": "gpt-3.5-turbo",
+        "description": "OpenAI GPT-3.5 Turbo (Fast & Efficient)"
+      },
+      {
+        "id": "claude-3",
+        "description": "Anthropic Claude 3 (Thoughtful & Safe)"
+      },
+      {
+        "id": "gemma3:8b",
+        "description": "Google Gemma 3 8B (Open Source)"
+      }
+    ],
+    "required": false
+  },
+  "model-size": {
+    "type": "string",
+    "description": "Model size variant",
+    "default": "8b",
+    "enum": ["2b", "8b", "12b", "70b"],
+    "required": false
+  },
+  "temperature": {
+    "type": "number",
+    "description": "Model temperature for generation",
+    "default": 0.7,
+    "minimum": 0.0,
+    "maximum": 2.0,
+    "required": false
+  },
+  "chunk-size": {
+    "type": "integer",
+    "description": "Document chunk size",
+    "default": 512,
+    "minimum": 128,
+    "maximum": 2048,
+    "required": false
+  }
+}
+```
+
+#### Flow Class with Parameter References
+
+Flow classes define parameter metadata with type references, descriptions, and ordering:
+
+```json
+{
+  "flow_class": "document-analysis",
+  "parameters": {
+    "llm-model": {
+      "type": "llm-model",
+      "description": "Primary LLM model for text completion",
+      "order": 1
+    },
+    "llm-rag-model": {
+      "type": "llm-model",
+      "description": "LLM model for RAG operations",
+      "order": 2,
+      "advanced": true,
+      "controlled-by": "llm-model"
+    },
+    "llm-temperature": {
+      "type": "temperature",
+      "description": "Generation temperature for creativity control",
+      "order": 3,
+      "advanced": true
+    },
+    "chunk-size": {
+      "type": "chunk-size",
+      "description": "Document chunk size for processing",
+      "order": 4,
+      "advanced": true
+    },
+    "chunk-overlap": {
+      "type": "integer",
+      "description": "Overlap between document chunks",
+      "order": 5,
+      "advanced": true,
+      "controlled-by": "chunk-size"
+    }
+  },
+  "class": {
+    "text-completion:{class}": {
+      "request": "non-persistent://tg/request/text-completion:{class}",
+      "response": "non-persistent://tg/response/text-completion:{class}",
+      "parameters": {
+        "model": "{llm-model}",
+        "temperature": "{llm-temperature}"
+      }
+    },
+    "rag-completion:{class}": {
+      "request": "non-persistent://tg/request/rag-completion:{class}",
+      "response": "non-persistent://tg/response/rag-completion:{class}",
+      "parameters": {
+        "model": "{llm-rag-model}",
+        "temperature": "{llm-temperature}"
+      }
+    }
+  },
+  "flow": {
+    "chunker:{id}": {
+      "input": "persistent://tg/flow/chunk:{id}",
+      "output": "persistent://tg/flow/chunk-load:{id}",
+      "parameters": {
+        "chunk_size": "{chunk-size}",
+        "chunk_overlap": "{chunk-overlap}"
+      }
+    }
+  }
+}
+```
+
+The `parameters` section maps flow-specific parameter names (keys) to parameter metadata objects containing:
+- `type`: Reference to centrally-defined parameter definition (e.g., "llm-model")
+- `description`: Human-readable description for UI display
+- `order`: Display order for parameter forms (lower numbers appear first)
+- `advanced` (optional): Boolean flag indicating if this is an advanced parameter (default: false). When set to true, the UI may hide this parameter by default or place it in an "Advanced" section
+- `controlled-by` (optional): Name of another parameter that controls this parameter's value when in simple mode. When specified, this parameter inherits its value from the controlling parameter unless explicitly overridden
+
+This approach allows:
+- Reusable parameter type definitions across multiple flow classes
+- Centralized parameter type management and validation
+- Flow-specific parameter descriptions and ordering
+- Enhanced UI experience with descriptive parameter forms
+- Consistent parameter validation across flows
+- Easy addition of new standard parameter types
+- Simplified UI with basic/advanced mode separation
+- Parameter value inheritance for related settings
+
+#### Flow Launch Request
+
+The flow launch API accepts parameters using the flow's parameter names:
+
+```json
+{
+  "flow_class": "document-analysis",
+  "flow_id": "customer-A-flow",
+  "parameters": {
+    "llm-model": "claude-3",
+    "llm-temperature": 0.5,
+    "chunk-size": 1024
+  }
+}
+```
+
+Note: In this example, `llm-rag-model` is not explicitly provided but will inherit the value "claude-3" from `llm-model` due to its `controlled-by` relationship. Similarly, `chunk-overlap` could inherit a calculated value based on `chunk-size`.
+
+The system will:
+1. Extract parameter metadata from flow class definition
+2. Map flow parameter names to their type definitions (e.g., `llm-model` → `llm-model` type)
+3. Resolve controlled-by relationships (e.g., `llm-rag-model` inherits from `llm-model`)
+4. Validate user-provided and inherited values against the parameter type definitions
+5. Substitute resolved values into processor parameters during flow instantiation
+
+### Implementation Details
+
+#### Parameter Resolution Process
+
+When a flow is started, the system performs the following parameter resolution steps:
+
+1. **Flow Class Loading**: Load flow class definition and extract parameter metadata
+2. **Metadata Extraction**: Extract `type`, `description`, `order`, `advanced`, and `controlled-by` for each parameter defined in the flow class's `parameters` section
+3. **Type Definition Lookup**: For each parameter in the flow class:
+   - Retrieve the parameter type definition from schema/config store using the `type` field
+   - The type definitions are stored with type "parameter-types" in the config system
+   - Each type definition contains the parameter's schema, default value, and validation rules
+4. **Default Value Resolution**:
+   - For each parameter defined in the flow class:
+     - Check if the user provided a value for this parameter
+     - If no user value provided, use the `default` value from the parameter type definition
+     - Build a complete parameter map containing both user-provided and default values
+5. **Parameter Inheritance Resolution** (controlled-by relationships):
+   - For parameters with `controlled-by` field, check if a value was explicitly provided
+   - If no explicit value provided, inherit the value from the controlling parameter
+   - If the controlling parameter also has no value, use the default from the type definition
+   - Validate that no circular dependencies exist in `controlled-by` relationships
+6. **Validation**: Validate the complete parameter set (user-provided, defaults, and inherited) against type definitions
+7. **Storage**: Store the complete resolved parameter set with the flow instance for auditability
+8. **Template Substitution**: Replace parameter placeholders in processor parameters with resolved values
+9. **Processor Instantiation**: Create processors with substituted parameters
+
+**Important Implementation Notes:**
+- The flow service MUST merge user-provided parameters with defaults from parameter type definitions
+- The complete parameter set (including applied defaults) MUST be stored with the flow for traceability
+- Parameter resolution happens at flow start time, not at processor instantiation time
+- Missing required parameters without defaults MUST cause flow start to fail with a clear error message
+
+#### Parameter Inheritance with controlled-by
+
+The `controlled-by` field enables parameter value inheritance, particularly useful for simplifying user interfaces while maintaining flexibility:
+
+**Example Scenario**:
+- `llm-model` parameter controls the primary LLM model
+- `llm-rag-model` parameter has `"controlled-by": "llm-model"`
+- In simple mode, setting `llm-model` to "gpt-4" automatically sets `llm-rag-model` to "gpt-4" as well
+- In advanced mode, users can override `llm-rag-model` with a different value
+
+**Resolution Rules**:
+1. If a parameter has an explicitly provided value, use that value
+2. If no explicit value and `controlled-by` is set, use the controlling parameter's value
+3. If the controlling parameter has no value, fall back to the default from the type definition
+4. Circular dependencies in `controlled-by` relationships result in a validation error
+
+**UI Behavior**:
+- In basic/simple mode: Parameters with `controlled-by` may be hidden or shown as read-only with inherited value
+- In advanced mode: All parameters are shown and can be individually configured
+- When a controlling parameter changes, dependent parameters update automatically unless explicitly overridden
+
+#### Pulsar Integration
+
+1. **Start-Flow Operation**
+   - The Pulsar start-flow operation needs to accept a `parameters` field containing a map of parameter values
+   - The Pulsar schema for the start-flow request must be updated to include the optional `parameters` field
+   - Example request:
+   ```json
+   {
+     "flow_class": "document-analysis",
+     "flow_id": "customer-A-flow",
+     "parameters": {
+       "model": "claude-3",
+       "size": "12b",
+       "temp": 0.5,
+       "chunk": 1024
+     }
+   }
+   ```
+
+2. **Get-Flow Operation**
+   - The Pulsar schema for the get-flow response must be updated to include the `parameters` field
+   - This allows clients to retrieve the parameter values that were used when the flow was started
+   - Example response:
+   ```json
+   {
+     "flow_id": "customer-A-flow",
+     "flow_class": "document-analysis",
+     "status": "running",
+     "parameters": {
+       "model": "claude-3",
+       "size": "12b",
+       "temp": 0.5,
+       "chunk": 1024
+     }
+   }
+   ```
+
+#### Flow Service Implementation
+
+The flow configuration service (`trustgraph-flow/trustgraph/config/service/flow.py`) requires the following enhancements:
+
+1. **Parameter Resolution Function**
+   ```python
+   async def resolve_parameters(self, flow_class, user_params):
+       """
+       Resolve parameters by merging user-provided values with defaults.
+
+       Args:
+           flow_class: The flow class definition dict
+           user_params: User-provided parameters dict
+
+       Returns:
+           Complete parameter dict with user values and defaults merged
+       """
+   ```
+
+   This function should:
+   - Extract parameter metadata from the flow class's `parameters` section
+   - For each parameter, fetch its type definition from config store
+   - Apply defaults for any parameters not provided by the user
+   - Handle `controlled-by` inheritance relationships
+   - Return the complete parameter set
+
+2. **Modified `handle_start_flow` Method**
+   - Call `resolve_parameters` after loading the flow class
+   - Use the complete resolved parameter set for template substitution
+   - Store the complete parameter set (not just user-provided) with the flow
+   - Validate that all required parameters have values
+
+3. **Parameter Type Fetching**
+   - Parameter type definitions are stored in config with type "parameter-types"
+   - Each type definition contains schema, default value, and validation rules
+   - Cache frequently-used parameter types to reduce config lookups
+
+#### Config System Integration
+
+3. **Flow Object Storage**
+   - When a flow is added to the config system by the flow component in the config manager, the flow object must include the resolved parameter values
+   - The config manager needs to store both the original user-provided parameters and the resolved values (with defaults applied)
+   - Flow objects in the config system should include:
+     - `parameters`: The final resolved parameter values used for the flow
+
+#### CLI Integration
+
+4. **Library CLI Commands**
+   - CLI commands that start flows need parameter support:
+     - Accept parameter values via command-line flags or configuration files
+     - Validate parameters against flow class definitions before submission
+     - Support parameter file input (JSON/YAML) for complex parameter sets
+
+   - CLI commands that show flows need to display parameter information:
+     - Show parameter values used when the flow was started
+     - Display available parameters for a flow class
+     - Show parameter validation schemas and defaults
+
+#### Processor Base Class Integration
+
+5. **ParameterSpec Support**
+   - Processor base classes need to support parameter substitution through the existing ParametersSpec mechanism
+   - The ParametersSpec class (located in the same module as ConsumerSpec and ProducerSpec) should be enhanced if necessary to support parameter template substitution
+   - Processors should be able to invoke ParametersSpec to configure their parameters with parameter values resolved at flow launch time
+   - The ParametersSpec implementation needs to:
+     - Accept parameters configurations that contain parameter placeholders (e.g., `{model}`, `{temperature}`)
+     - Support runtime parameter substitution when the processor is instantiated
+     - Validate that substituted values match expected types and constraints
+     - Provide error handling for missing or invalid parameter references
+
+#### Substitution Rules
+
+- Parameters use the format `{parameter-name}` in processor parameters
+- Parameter names in parameters match the keys in the flow's `parameters` section
+- Substitution occurs alongside `{id}` and `{class}` replacement
+- Invalid parameter references result in launch-time errors
+- Type validation happens based on the centrally-stored parameter definition
+- **IMPORTANT**: All parameter values are stored and transmitted as strings
+  - Numbers are converted to strings (e.g., `0.7` becomes `"0.7"`)
+  - Booleans are converted to lowercase strings (e.g., `true` becomes `"true"`)
+  - This is required by the Pulsar schema which defines `parameters = Map(String())`
+
+Example resolution:
+```
+Flow parameter mapping: "model": "llm-model"
+Processor parameter: "model": "{model}"
+User provides: "model": "gemma3:8b"
+Final parameter: "model": "gemma3:8b"
+
+Example with type conversion:
+Parameter type default: 0.7 (number)
+Stored in flow: "0.7" (string)
+Substituted in processor: "0.7" (string)
+```
+
+## Testing Strategy
+
+- Unit tests for parameter schema validation
+- Integration tests for parameter substitution in processor parameters
+- End-to-end tests for launching flows with different parameter values
+- UI tests for parameter form generation and validation
+- Performance tests for flows with many parameters
+- Edge cases: missing parameters, invalid types, undefined parameter references
+
+## Migration Plan
+
+1. The system should continue to support flow classes with no parameters
+   declared.
+2. The system should continue to support flows no parameters specified:
+   This works for flows with no parameters, and flows with parameters
+   (they have defaults).
+
+## Open Questions
+
+Q: Should parameters support complex nested objects or keep to simple types?
+A: The parameter values will be string encoded, we're probably going to want
+   to stick to strings.
+
+Q: Should parameter placeholders be allowed in queue names or only in
+   parameters?
+A: Only in parameters to remove strange injections and edge-cases.
+
+Q: How to handle conflicts between parameter names and system variables like
+   `id` and `class`?
+A: It is not valid to specify id and class when launching a flow
+
+Q: Should we support computed parameters (derived from other parameters)?
+A: Just string substitution to remove strange injections and edge-cases.
+
+## References
+
+- JSON Schema Specification: https://json-schema.org/
+- Flow Class Definition Spec: docs/tech-specs/flow-class-definition.md
--- a/docs/tech-specs/graphrag-performance-optimization.md
+++ b/docs/tech-specs/graphrag-performance-optimization.md
@ -0,0 +1,629 @@
+# GraphRAG Performance Optimisation Technical Specification
+
+## Overview
+
+This specification describes comprehensive performance optimisations for the GraphRAG (Graph Retrieval-Augmented Generation) algorithm in TrustGraph. The current implementation suffers from significant performance bottlenecks that limit scalability and response times. This specification addresses four primary optimisation areas:
+
+1. **Graph Traversal Optimisation**: Eliminate inefficient recursive database queries and implement batched graph exploration
+2. **Label Resolution Optimisation**: Replace sequential label fetching with parallel/batched operations
+3. **Caching Strategy Enhancement**: Implement intelligent caching with LRU eviction and prefetching
+4. **Query Optimisation**: Add result memoisation and embedding caching for improved response times
+
+## Goals
+
+- **Reduce Database Query Volume**: Achieve 50-80% reduction in total database queries through batching and caching
+- **Improve Response Times**: Target 3-5x faster subgraph construction and 2-3x faster label resolution
+- **Enhance Scalability**: Support larger knowledge graphs with better memory management
+- **Maintain Accuracy**: Preserve existing GraphRAG functionality and result quality
+- **Enable Concurrency**: Improve parallel processing capabilities for multiple concurrent requests
+- **Reduce Memory Footprint**: Implement efficient data structures and memory management
+- **Add Observability**: Include performance metrics and monitoring capabilities
+- **Ensure Reliability**: Add proper error handling and timeout mechanisms
+
+## Background
+
+The current GraphRAG implementation in `trustgraph-flow/trustgraph/retrieval/graph_rag/graph_rag.py` exhibits several critical performance issues that severely impact system scalability:
+
+### Current Performance Problems
+
+**1. Inefficient Graph Traversal (`follow_edges` function, lines 79-127)**
+- Makes 3 separate database queries per entity per depth level
+- Query pattern: subject-based, predicate-based, and object-based queries for each entity
+- No batching: Each query processes only one entity at a time
+- No cycle detection: Can revisit the same nodes multiple times
+- Recursive implementation without memoisation leads to exponential complexity
+- Time complexity: O(entities × max_path_length × triple_limit³)
+
+**2. Sequential Label Resolution (`get_labelgraph` function, lines 144-171)**
+- Processes each triple component (subject, predicate, object) sequentially
+- Each `maybe_label` call potentially triggers a database query
+- No parallel execution or batching of label queries
+- Results in up to 3 × subgraph_size individual database calls
+
+**3. Primitive Caching Strategy (`maybe_label` function, lines 62-77)**
+- Simple dictionary cache without size limits or TTL
+- No cache eviction policy leads to unbounded memory growth
+- Cache misses trigger individual database queries
+- No prefetching or intelligent cache warming
+
+**4. Suboptimal Query Patterns**
+- Entity vector similarity queries not cached between similar requests
+- No result memoisation for repeated query patterns
+- Missing query optimisation for common access patterns
+
+**5. Critical Object Lifetime Issues (`rag.py:96-102`)**
+- **GraphRag object recreated per request**: Fresh instance created for every query, losing all cache benefits
+- **Query object extremely short-lived**: Created and destroyed within single query execution (lines 201-207)
+- **Label cache reset per request**: Cache warming and accumulated knowledge lost between requests
+- **Client recreation overhead**: Database clients potentially re-established for each request
+- **No cross-request optimisation**: Cannot benefit from query patterns or result sharing
+
+### Performance Impact Analysis
+
+Current worst-case scenario for a typical query:
+- **Entity Retrieval**: 1 vector similarity query
+- **Graph Traversal**: entities × max_path_length × 3 × triple_limit queries
+- **Label Resolution**: subgraph_size × 3 individual label queries
+
+For default parameters (50 entities, path length 2, 30 triple limit, 150 subgraph size):
+- **Minimum queries**: 1 + (50 × 2 × 3 × 30) + (150 × 3) = **9,451 database queries**
+- **Response time**: 15-30 seconds for moderate-sized graphs
+- **Memory usage**: Unbounded cache growth over time
+- **Cache effectiveness**: 0% - caches reset on every request
+- **Object creation overhead**: GraphRag + Query objects created/destroyed per request
+
+This specification addresses these gaps by implementing batched queries, intelligent caching, and parallel processing. By optimizing query patterns and data access, TrustGraph can:
+- Support enterprise-scale knowledge graphs with millions of entities
+- Provide sub-second response times for typical queries
+- Handle hundreds of concurrent GraphRAG requests
+- Scale efficiently with graph size and complexity
+
+## Technical Design
+
+### Architecture
+
+The GraphRAG performance optimisation requires the following technical components:
+
+#### 1. **Object Lifetime Architectural Refactor**
+   - **Make GraphRag long-lived**: Move GraphRag instance to Processor level for persistence across requests
+   - **Preserve caches**: Maintain label cache, embedding cache, and query result cache between requests
+   - **Optimize Query object**: Refactor Query as lightweight execution context, not data container
+   - **Connection persistence**: Maintain database client connections across requests
+
+   Module: `trustgraph-flow/trustgraph/retrieval/graph_rag/rag.py` (modified)
+
+#### 2. **Optimized Graph Traversal Engine**
+   - Replace recursive `follow_edges` with iterative breadth-first search
+   - Implement batched entity processing at each traversal level
+   - Add cycle detection using visited node tracking
+   - Include early termination when limits are reached
+
+   Module: `trustgraph-flow/trustgraph/retrieval/graph_rag/optimized_traversal.py`
+
+#### 3. **Parallel Label Resolution System**
+   - Batch label queries for multiple entities simultaneously
+   - Implement async/await patterns for concurrent database access
+   - Add intelligent prefetching for common label patterns
+   - Include label cache warming strategies
+
+   Module: `trustgraph-flow/trustgraph/retrieval/graph_rag/label_resolver.py`
+
+#### 4. **Conservative Label Caching Layer**
+   - LRU cache with short TTL for labels only (5min) to balance performance vs consistency
+   - Cache metrics and hit ratio monitoring
+   - **No embedding caching**: Already cached per-query, no cross-query benefit
+   - **No query result caching**: Due to graph mutation consistency concerns
+
+   Module: `trustgraph-flow/trustgraph/retrieval/graph_rag/cache_manager.py`
+
+#### 5. **Query Optimisation Framework**
+   - Query pattern analysis and optimisation suggestions
+   - Batch query coordinator for database access
+   - Connection pooling and query timeout management
+   - Performance monitoring and metrics collection
+
+   Module: `trustgraph-flow/trustgraph/retrieval/graph_rag/query_optimizer.py`
+
+### Data Models
+
+#### Optimized Graph Traversal State
+
+The traversal engine maintains state to avoid redundant operations:
+
+```python
+@dataclass
+class TraversalState:
+    visited_entities: Set[str]
+    current_level_entities: Set[str]
+    next_level_entities: Set[str]
+    subgraph: Set[Tuple[str, str, str]]
+    depth: int
+    query_batch: List[TripleQuery]
+```
+
+This approach allows:
+- Efficient cycle detection through visited entity tracking
+- Batched query preparation at each traversal level
+- Memory-efficient state management
+- Early termination when size limits are reached
+
+#### Enhanced Cache Structure
+
+```python
+@dataclass
+class CacheEntry:
+    value: Any
+    timestamp: float
+    access_count: int
+    ttl: Optional[float]
+
+class CacheManager:
+    label_cache: LRUCache[str, CacheEntry]
+    embedding_cache: LRUCache[str, CacheEntry]
+    query_result_cache: LRUCache[str, CacheEntry]
+    cache_stats: CacheStatistics
+```
+
+#### Batch Query Structures
+
+```python
+@dataclass
+class BatchTripleQuery:
+    entities: List[str]
+    query_type: QueryType  # SUBJECT, PREDICATE, OBJECT
+    limit_per_entity: int
+
+@dataclass
+class BatchLabelQuery:
+    entities: List[str]
+    predicate: str = LABEL
+```
+
+### APIs
+
+#### New APIs:
+
+**GraphTraversal API**
+```python
+async def optimized_follow_edges_batch(
+    entities: List[str],
+    max_depth: int,
+    triple_limit: int,
+    max_subgraph_size: int
+) -> Set[Tuple[str, str, str]]
+```
+
+**Batch Label Resolution API**
+```python
+async def resolve_labels_batch(
+    entities: List[str],
+    cache_manager: CacheManager
+) -> Dict[str, str]
+```
+
+**Cache Management API**
+```python
+class CacheManager:
+    async def get_or_fetch_label(self, entity: str) -> str
+    async def get_or_fetch_embeddings(self, query: str) -> List[float]
+    async def cache_query_result(self, query_hash: str, result: Any, ttl: int)
+    def get_cache_statistics(self) -> CacheStatistics
+```
+
+#### Modified APIs:
+
+**GraphRag.query()** - Enhanced with performance optimisations:
+- Add cache_manager parameter for cache control
+- Include performance_metrics return value
+- Add query_timeout parameter for reliability
+
+**Query class** - Refactored for batch processing:
+- Replace individual entity processing with batch operations
+- Add async context managers for resource cleanup
+- Include progress callbacks for long-running operations
+
+### Implementation Details
+
+#### Phase 0: Critical Architectural Lifetime Refactor
+
+**Current Problematic Implementation:**
+```python
+# INEFFICIENT: GraphRag recreated every request
+class Processor(FlowProcessor):
+    async def on_request(self, msg, consumer, flow):
+        # PROBLEM: New GraphRag instance per request!
+        self.rag = GraphRag(
+            embeddings_client = flow("embeddings-request"),
+            graph_embeddings_client = flow("graph-embeddings-request"),
+            triples_client = flow("triples-request"),
+            prompt_client = flow("prompt-request"),
+            verbose=True,
+        )
+        # Cache starts empty every time - no benefit from previous requests
+        response = await self.rag.query(...)
+
+# VERY SHORT-LIVED: Query object created/destroyed per request
+class GraphRag:
+    async def query(self, query, user="trustgraph", collection="default", ...):
+        q = Query(rag=self, user=user, collection=collection, ...)  # Created
+        kg = await q.get_labelgraph(query)  # Used briefly
+        # q automatically destroyed when function exits
+```
+
+**Optimized Long-Lived Architecture:**
+```python
+class Processor(FlowProcessor):
+    def __init__(self, **params):
+        super().__init__(**params)
+        self.rag_instance = None  # Will be initialized once
+        self.client_connections = {}
+
+    async def initialize_rag(self, flow):
+        """Initialize GraphRag once, reuse for all requests"""
+        if self.rag_instance is None:
+            self.rag_instance = LongLivedGraphRag(
+                embeddings_client=flow("embeddings-request"),
+                graph_embeddings_client=flow("graph-embeddings-request"),
+                triples_client=flow("triples-request"),
+                prompt_client=flow("prompt-request"),
+                verbose=True,
+            )
+        return self.rag_instance
+
+    async def on_request(self, msg, consumer, flow):
+        # REUSE the same GraphRag instance - caches persist!
+        rag = await self.initialize_rag(flow)
+
+        # Query object becomes lightweight execution context
+        response = await rag.query_with_context(
+            query=v.query,
+            execution_context=QueryContext(
+                user=v.user,
+                collection=v.collection,
+                entity_limit=entity_limit,
+                # ... other params
+            )
+        )
+
+class LongLivedGraphRag:
+    def __init__(self, ...):
+        # CONSERVATIVE caches - balance performance vs consistency
+        self.label_cache = LRUCacheWithTTL(max_size=5000, ttl=300)  # 5min TTL for freshness
+        # Note: No embedding cache - already cached per-query, no cross-query benefit
+        # Note: No query result cache due to consistency concerns
+        self.performance_metrics = PerformanceTracker()
+
+    async def query_with_context(self, query: str, context: QueryContext):
+        # Use lightweight QueryExecutor instead of heavyweight Query object
+        executor = QueryExecutor(self, context)  # Minimal object
+        return await executor.execute(query)
+
+@dataclass
+class QueryContext:
+    """Lightweight execution context - no heavy operations"""
+    user: str
+    collection: str
+    entity_limit: int
+    triple_limit: int
+    max_subgraph_size: int
+    max_path_length: int
+
+class QueryExecutor:
+    """Lightweight execution context - replaces old Query class"""
+    def __init__(self, rag: LongLivedGraphRag, context: QueryContext):
+        self.rag = rag
+        self.context = context
+        # No heavy initialization - just references
+
+    async def execute(self, query: str):
+        # All heavy lifting uses persistent rag caches
+        return await self.rag.execute_optimized_query(query, self.context)
+```
+
+This architectural change provides:
+- **10-20% database query reduction** for graphs with common relationships (vs 0% currently)
+- **Eliminated object creation overhead** for every request
+- **Persistent connection pooling** and client reuse
+- **Cross-request optimization** within cache TTL windows
+
+**Important Cache Consistency Limitation:**
+Long-term caching introduces staleness risk when entities/labels are deleted or modified in the underlying graph. The LRU cache with TTL provides a balance between performance gains and data freshness, but cannot detect real-time graph changes.
+
+#### Phase 1: Graph Traversal Optimisation
+
+**Current Implementation Problems:**
+```python
+# INEFFICIENT: 3 queries per entity per level
+async def follow_edges(self, ent, subgraph, path_length):
+    # Query 1: s=ent, p=None, o=None
+    res = await self.rag.triples_client.query(s=ent, p=None, o=None, limit=self.triple_limit)
+    # Query 2: s=None, p=ent, o=None
+    res = await self.rag.triples_client.query(s=None, p=ent, o=None, limit=self.triple_limit)
+    # Query 3: s=None, p=None, o=ent
+    res = await self.rag.triples_client.query(s=None, p=None, o=ent, limit=self.triple_limit)
+```
+
+**Optimized Implementation:**
+```python
+async def optimized_traversal(self, entities: List[str], max_depth: int) -> Set[Triple]:
+    visited = set()
+    current_level = set(entities)
+    subgraph = set()
+
+    for depth in range(max_depth):
+        if not current_level or len(subgraph) >= self.max_subgraph_size:
+            break
+
+        # Batch all queries for current level
+        batch_queries = []
+        for entity in current_level:
+            if entity not in visited:
+                batch_queries.extend([
+                    TripleQuery(s=entity, p=None, o=None),
+                    TripleQuery(s=None, p=entity, o=None),
+                    TripleQuery(s=None, p=None, o=entity)
+                ])
+
+        # Execute all queries concurrently
+        results = await self.execute_batch_queries(batch_queries)
+
+        # Process results and prepare next level
+        next_level = set()
+        for result in results:
+            subgraph.update(result.triples)
+            next_level.update(result.new_entities)
+
+        visited.update(current_level)
+        current_level = next_level - visited
+
+    return subgraph
+```
+
+#### Phase 2: Parallel Label Resolution
+
+**Current Sequential Implementation:**
+```python
+# INEFFICIENT: Sequential processing
+for edge in subgraph:
+    s = await self.maybe_label(edge[0])  # Individual query
+    p = await self.maybe_label(edge[1])  # Individual query
+    o = await self.maybe_label(edge[2])  # Individual query
+```
+
+**Optimized Parallel Implementation:**
+```python
+async def resolve_labels_parallel(self, subgraph: List[Triple]) -> List[Triple]:
+    # Collect all unique entities needing labels
+    entities_to_resolve = set()
+    for s, p, o in subgraph:
+        entities_to_resolve.update([s, p, o])
+
+    # Remove already cached entities
+    uncached_entities = [e for e in entities_to_resolve if e not in self.label_cache]
+
+    # Batch query for all uncached labels
+    if uncached_entities:
+        label_results = await self.batch_label_query(uncached_entities)
+        self.label_cache.update(label_results)
+
+    # Apply labels to subgraph
+    return [
+        (self.label_cache.get(s, s), self.label_cache.get(p, p), self.label_cache.get(o, o))
+        for s, p, o in subgraph
+    ]
+```
+
+#### Phase 3: Advanced Caching Strategy
+
+**LRU Cache with TTL:**
+```python
+class LRUCacheWithTTL:
+    def __init__(self, max_size: int, default_ttl: int = 3600):
+        self.cache = OrderedDict()
+        self.max_size = max_size
+        self.default_ttl = default_ttl
+        self.access_times = {}
+
+    async def get(self, key: str) -> Optional[Any]:
+        if key in self.cache:
+            # Check TTL expiration
+            if time.time() - self.access_times[key] > self.default_ttl:
+                del self.cache[key]
+                del self.access_times[key]
+                return None
+
+            # Move to end (most recently used)
+            self.cache.move_to_end(key)
+            return self.cache[key]
+        return None
+
+    async def put(self, key: str, value: Any):
+        if key in self.cache:
+            self.cache.move_to_end(key)
+        else:
+            if len(self.cache) >= self.max_size:
+                # Remove least recently used
+                oldest_key = next(iter(self.cache))
+                del self.cache[oldest_key]
+                del self.access_times[oldest_key]
+
+        self.cache[key] = value
+        self.access_times[key] = time.time()
+```
+
+#### Phase 4: Query Optimisation and Monitoring
+
+**Performance Metrics Collection:**
+```python
+@dataclass
+class PerformanceMetrics:
+    total_queries: int
+    cache_hits: int
+    cache_misses: int
+    avg_response_time: float
+    subgraph_construction_time: float
+    label_resolution_time: float
+    total_entities_processed: int
+    memory_usage_mb: float
+```
+
+**Query Timeout and Circuit Breaker:**
+```python
+async def execute_with_timeout(self, query_func, timeout: int = 30):
+    try:
+        return await asyncio.wait_for(query_func(), timeout=timeout)
+    except asyncio.TimeoutError:
+        logger.error(f"Query timeout after {timeout}s")
+        raise GraphRagTimeoutError(f"Query exceeded timeout of {timeout}s")
+```
+
+## Cache Consistency Considerations
+
+**Data Staleness Trade-offs:**
+- **Label cache (5min TTL)**: Risk of serving deleted/renamed entity labels
+- **No embedding caching**: Not needed - embeddings already cached per-query
+- **No result caching**: Prevents stale subgraph results from deleted entities/relationships
+
+**Mitigation Strategies:**
+- **Conservative TTL values**: Balance performance gains (10-20%) with data freshness
+- **Cache invalidation hooks**: Optional integration with graph mutation events
+- **Monitoring dashboards**: Track cache hit rates vs staleness incidents
+- **Configurable cache policies**: Allow per-deployment tuning based on mutation frequency
+
+**Recommended Cache Configuration by Graph Mutation Rate:**
+- **High mutation (>100 changes/hour)**: TTL=60s, smaller cache sizes
+- **Medium mutation (10-100 changes/hour)**: TTL=300s (default)
+- **Low mutation (<10 changes/hour)**: TTL=600s, larger cache sizes
+
+## Security Considerations
+
+**Query Injection Prevention:**
+- Validate all entity identifiers and query parameters
+- Use parameterized queries for all database interactions
+- Implement query complexity limits to prevent DoS attacks
+
+**Resource Protection:**
+- Enforce maximum subgraph size limits
+- Implement query timeouts to prevent resource exhaustion
+- Add memory usage monitoring and limits
+
+**Access Control:**
+- Maintain existing user and collection isolation
+- Add audit logging for performance-impacting operations
+- Implement rate limiting for expensive operations
+
+## Performance Considerations
+
+### Expected Performance Improvements
+
+**Query Reduction:**
+- Current: ~9,000+ queries for typical request
+- Optimized: ~50-100 batched queries (98% reduction)
+
+**Response Time Improvements:**
+- Graph traversal: 15-20s → 3-5s (4-5x faster)
+- Label resolution: 8-12s → 2-4s (3x faster)
+- Overall query: 25-35s → 6-10s (3-4x improvement)
+
+**Memory Efficiency:**
+- Bounded cache sizes prevent memory leaks
+- Efficient data structures reduce memory footprint by ~40%
+- Better garbage collection through proper resource cleanup
+
+**Realistic Performance Expectations:**
+- **Label cache**: 10-20% query reduction for graphs with common relationships
+- **Batching optimization**: 50-80% query reduction (primary optimization)
+- **Object lifetime optimization**: Eliminate per-request creation overhead
+- **Overall improvement**: 3-4x response time improvement primarily from batching
+
+**Scalability Improvements:**
+- Support for 3-5x larger knowledge graphs (limited by cache consistency needs)
+- 3-5x higher concurrent request capacity
+- Better resource utilization through connection reuse
+
+### Performance Monitoring
+
+**Real-time Metrics:**
+- Query execution times by operation type
+- Cache hit ratios and effectiveness
+- Database connection pool utilisation
+- Memory usage and garbage collection impact
+
+**Performance Benchmarking:**
+- Automated performance regression testing
+- Load testing with realistic data volumes
+- Comparison benchmarks against current implementation
+
+## Testing Strategy
+
+### Unit Testing
+- Individual component testing for traversal, caching, and label resolution
+- Mock database interactions for performance testing
+- Cache eviction and TTL expiration testing
+- Error handling and timeout scenarios
+
+### Integration Testing
+- End-to-end GraphRAG query testing with optimisations
+- Database interaction testing with real data
+- Concurrent request handling and resource management
+- Memory leak detection and resource cleanup verification
+
+### Performance Testing
+- Benchmark testing against current implementation
+- Load testing with varying graph sizes and complexities
+- Stress testing for memory and connection limits
+- Regression testing for performance improvements
+
+### Compatibility Testing
+- Verify existing GraphRAG API compatibility
+- Test with various graph database backends
+- Validate result accuracy compared to current implementation
+
+## Implementation Plan
+
+### Direct Implementation Approach
+Since APIs are allowed to change, implement optimizations directly without migration complexity:
+
+1. **Replace `follow_edges` method**: Rewrite with iterative batched traversal
+2. **Optimize `get_labelgraph`**: Implement parallel label resolution
+3. **Add long-lived GraphRag**: Modify Processor to maintain persistent instance
+4. **Implement label caching**: Add LRU cache with TTL to GraphRag class
+
+### Scope of Changes
+- **Query class**: Replace ~50 lines in `follow_edges`, add ~30 lines batch handling
+- **GraphRag class**: Add caching layer (~40 lines)
+- **Processor class**: Modify to use persistent GraphRag instance (~20 lines)
+- **Total**: ~140 lines of focused changes, mostly within existing classes
+
+## Timeline
+
+**Week 1: Core Implementation**
+- Replace `follow_edges` with batched iterative traversal
+- Implement parallel label resolution in `get_labelgraph`
+- Add long-lived GraphRag instance to Processor
+- Implement label caching layer
+
+**Week 2: Testing and Integration**
+- Unit tests for new traversal and caching logic
+- Performance benchmarking against current implementation
+- Integration testing with real graph data
+- Code review and optimization
+
+**Week 3: Deployment**
+- Deploy optimized implementation
+- Monitor performance improvements
+- Fine-tune cache TTL and batch sizes based on real usage
+
+## Open Questions
+
+- **Database Connection Pooling**: Should we implement custom connection pooling or rely on existing database client pooling?
+- **Cache Persistence**: Should label and embedding caches persist across service restarts?
+- **Distributed Caching**: For multi-instance deployments, should we implement distributed caching with Redis/Memcached?
+- **Query Result Format**: Should we optimize the internal triple representation for better memory efficiency?
+- **Monitoring Integration**: Which metrics should be exposed to existing monitoring systems (Prometheus, etc.)?
+
+## References
+
+- [GraphRAG Original Implementation](trustgraph-flow/trustgraph/retrieval/graph_rag/graph_rag.py)
+- [TrustGraph Architecture Principles](architecture-principles.md)
+- [Collection Management Specification](collection-management.md)