Collection delete pt. 3 (#542)

* Fixing collection deletion

* Fixing collection management param error

* Always test for collections

* Add Cassandra collection table

* Updated tech spec for explicit creation/deletion

* Remove implicit collection creation

* Fix up collection tracking in all processors
This commit is contained in:
cybermaggedon 2025-09-30 16:02:33 +01:00 committed by GitHub
parent dc79b10552
commit 52b133fc86
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
31 changed files with 1761 additions and 843 deletions

View file

@ -2,16 +2,17 @@
## Overview
This specification describes the collection management capabilities for TrustGraph, enabling users to have explicit control over collections that are currently implicitly created during data loading and querying operations. The feature supports four primary use cases:
This specification describes the collection management capabilities for TrustGraph, requiring explicit collection creation and providing direct control over the collection lifecycle. Collections must be explicitly created before use, ensuring proper synchronization between the librarian metadata and all storage backends. The feature supports four primary use cases:
1. **Collection Listing**: View all existing collections in the system
2. **Collection Deletion**: Remove unwanted collections and their associated data
3. **Collection Labeling**: Associate descriptive labels with collections for better organization
4. **Collection Tagging**: Apply tags to collections for categorization and easier discovery
1. **Collection Creation**: Explicitly create collections before storing data
2. **Collection Listing**: View all existing collections in the system
3. **Collection Metadata Management**: Update collection names, descriptions, and tags
4. **Collection Deletion**: Remove collections and their associated data across all storage types
## Goals
- **Explicit Collection Control**: Provide users with direct management capabilities over collections beyond implicit creation
- **Explicit Collection Creation**: Require collections to be created before data can be stored
- **Storage Synchronization**: Ensure collections exist in all storage backends (vectors, objects, triples)
- **Collection Visibility**: Enable users to list and inspect all collections in their environment
- **Collection Cleanup**: Allow deletion of collections that are no longer needed
- **Collection Organization**: Support labels and tags for better collection tracking and discovery
@ -19,22 +20,25 @@ This specification describes the collection management capabilities for TrustGra
- **Collection Discovery**: Make it easier to find specific collections through filtering and search
- **Operational Transparency**: Provide clear visibility into collection lifecycle and usage
- **Resource Management**: Enable cleanup of unused collections to optimize resource utilization
- **Data Integrity**: Prevent orphaned collections in storage without metadata tracking
## Background
Currently, collections in TrustGraph are implicitly created during data loading operations and query execution. While this provides convenience for users, it lacks the explicit control needed for production environments and long-term data management.
Previously, collections in TrustGraph were implicitly created during data loading operations, leading to synchronization issues where collections could exist in storage backends without corresponding metadata in the librarian. This created management challenges and potential orphaned data.
Current limitations include:
- No way to list existing collections
- No mechanism to delete unwanted collections
- No ability to associate metadata with collections for tracking purposes
- Difficulty in organizing and discovering collections over time
The explicit collection creation model addresses these issues by:
- Requiring collections to be created before use via `tg-set-collection`
- Broadcasting collection creation to all storage backends
- Maintaining synchronized state between librarian metadata and storage
- Preventing writes to non-existent collections
- Providing clear collection lifecycle management
This specification addresses these gaps by introducing explicit collection management operations. By providing collection management APIs and commands, TrustGraph can:
- Give users full control over their collection lifecycle
- Enable better organization through labels and tags
- Support collection cleanup for resource optimization
- Improve operational visibility and management
This specification defines the explicit collection management model. By requiring explicit collection creation, TrustGraph ensures:
- Collections are tracked in librarian metadata from creation
- All storage backends are aware of collections before receiving data
- No orphaned collections exist in storage
- Clear operational visibility and control over collection lifecycle
- Consistent error handling when operations reference non-existent collections
## Technical Design
@ -98,24 +102,52 @@ This approach allows:
#### Collection Lifecycle
Collections follow a lazy-creation pattern that aligns with existing TrustGraph behavior:
Collections are explicitly created in the librarian before data operations can proceed:
1. **Lazy Creation**: Collections are automatically created when first referenced during data loading or query operations. No explicit create operation is needed.
1. **Collection Creation** (Two Paths):
2. **Implicit Registration**: When a collection is used (data loading, querying), the system checks if a metadata record exists. If not, a new record is created with default values:
- `name`: defaults to collection_id
- `description`: empty
- `tags`: empty set
- `created_at`: current timestamp
**Path A: User-Initiated Creation** via `tg-set-collection`:
- User provides collection ID, name, description, and tags
- Librarian creates metadata record in `collections` table
- Librarian broadcasts "create-collection" to all storage backends
- All storage processors create collection and confirm success
- Collection is now ready for data operations
3. **Explicit Updates**: Users can update collection metadata (name, description, tags) through management operations after lazy creation.
**Path B: Automatic Creation on Document Submission**:
- User submits document specifying a collection ID
- Librarian checks if collection exists in metadata table
- If not exists: Librarian creates metadata with defaults (name=collection_id, empty description/tags)
- Librarian broadcasts "create-collection" to all storage backends
- All storage processors create collection and confirm success
- Document processing proceeds with collection now established
4. **Explicit Deletion**: Users can delete collections, which removes both the metadata record and the underlying collection data across all store types.
Both paths ensure collection exists in librarian metadata AND all storage backends before data operations.
5. **Multi-Store Deletion**: Collection deletion cascades across all storage backends (vector stores, object stores, triple stores) as each implements lazy creation and must support collection deletion.
2. **Storage Validation**: Write operations validate collection exists:
- Storage processors check collection state before accepting writes
- Writes to non-existent collections return error
- This prevents direct writes bypassing the librarian's collection creation logic
3. **Query Behavior**: Query operations handle non-existent collections gracefully:
- Queries to non-existent collections return empty results
- No error thrown for query operations
- Allows exploration without requiring collection to exist
4. **Metadata Updates**: Users can update collection metadata after creation:
- Update name, description, and tags via `tg-set-collection`
- Updates apply to librarian metadata only
- Storage backends maintain collection but metadata updates don't propagate
5. **Explicit Deletion**: Users delete collections via `tg-delete-collection`:
- Librarian broadcasts "delete-collection" to all storage backends
- Waits for confirmation from all storage processors
- Deletes librarian metadata record only after storage cleanup complete
- Ensures no orphaned data remains in storage
**Key Principle**: The librarian is the single point of control for collection creation. Whether initiated by user command or document submission, the librarian ensures proper metadata tracking and storage backend synchronization before allowing data operations.
Operations required:
- **Collection Use Notification**: Internal operation triggered during data loading/querying to ensure metadata record exists
- **Create Collection**: User operation via `tg-set-collection` OR automatic on document submission
- **Update Collection Metadata**: User operation to modify name, description, and tags
- **Delete Collection**: User operation to remove collection and its data across all stores
- **List Collections**: User operation to view collections with filtering by tags
@ -123,32 +155,65 @@ Operations required:
#### Multi-Store Collection Management
Collections exist across multiple storage backends in TrustGraph:
- **Vector Stores**: Store embeddings and vector data for collections
- **Object Stores**: Store documents and file data for collections
- **Triple Stores**: Store graph/RDF data for collections
- **Vector Stores** (Qdrant, Milvus, Pinecone): Store embeddings and vector data
- **Object Stores** (Cassandra): Store documents and file data
- **Triple Stores** (Cassandra, Neo4j, Memgraph, FalkorDB): Store graph/RDF data
Each store type implements:
- **Lazy Creation**: Collections are created implicitly when data is first stored
- **Collection Deletion**: Store-specific deletion operations to remove collection data
- **Collection State Tracking**: Maintain knowledge of which collections exist
- **Collection Creation**: Accept and process "create-collection" operations
- **Collection Validation**: Check collection exists before accepting writes
- **Collection Deletion**: Remove all data for specified collection
The librarian service coordinates collection operations across all store types, ensuring consistent collection lifecycle management.
The librarian service coordinates collection operations across all store types, ensuring:
- Collections created in all backends before use
- All backends confirm creation before returning success
- Synchronized collection lifecycle across storage types
- Consistent error handling when collections don't exist
#### Collection State Tracking by Storage Type
Each storage backend tracks collection state differently based on its capabilities:
**Cassandra Triple Store:**
- Uses existing `triples_collection` table
- Creates system marker triple when collection created
- Query: `SELECT collection FROM triples_collection WHERE collection = ? LIMIT 1`
- Efficient single-partition check for collection existence
**Qdrant/Milvus/Pinecone Vector Stores:**
- Native collection APIs provide existence checking
- Collections created with proper vector configuration
- `collection_exists()` method uses storage API
- Collection creation validates dimension requirements
**Neo4j/Memgraph/FalkorDB Graph Stores:**
- Use `:CollectionMetadata` nodes to track collections
- Node properties: `{user, collection, created_at}`
- Query: `MATCH (c:CollectionMetadata {user: $user, collection: $collection})`
- Separate from data nodes for clean separation
- Enables efficient collection listing and validation
**Cassandra Object Store:**
- Uses collection metadata table or marker rows
- Similar pattern to triple store
- Validates collection before document writes
### APIs
New APIs:
Collection Management APIs (Librarian):
- **Create/Update Collection**: Create new collection or update existing metadata via `tg-set-collection`
- **List Collections**: Retrieve collections for a user with optional tag filtering
- **Update Collection Metadata**: Modify collection name, description, and tags
- **Delete Collection**: Remove collection and associated data with confirmation, cascading to all store types
- **Collection Use Notification** (Internal): Ensure metadata record exists when collection is referenced
- **Delete Collection**: Remove collection and associated data, cascading to all store types
Store Writer APIs (Enhanced):
- **Vector Store Collection Deletion**: Remove vector data for specified user and collection
- **Object Store Collection Deletion**: Remove object/document data for specified user and collection
- **Triple Store Collection Deletion**: Remove graph/RDF data for specified user and collection
Storage Management APIs (All Storage Processors):
- **Create Collection**: Handle "create-collection" operation, establish collection in storage
- **Delete Collection**: Handle "delete-collection" operation, remove all collection data
- **Collection Exists Check**: Internal validation before accepting write operations
Modified APIs:
- **Data Loading APIs**: Enhanced to trigger collection use notification for lazy metadata creation
- **Query APIs**: Enhanced to trigger collection use notification and optionally include metadata in responses
Data Operation APIs (Modified Behavior):
- **Write APIs**: Validate collection exists before accepting data, return error if not
- **Query APIs**: Return empty results for non-existent collections without error
### Implementation Details
@ -168,32 +233,35 @@ When a user initiates collection deletion through the librarian service:
#### Collection Management Interface
All store writers will implement a standardized collection management interface with a common schema across store types:
All store writers implement a standardized collection management interface with a common schema:
**Message Schema:**
**Message Schema (`StorageManagementRequest`):**
```json
{
"operation": "delete-collection",
"operation": "create-collection" | "delete-collection",
"user": "user123",
"collection": "documents-2024",
"timestamp": "2024-01-15T10:30:00Z"
"collection": "documents-2024"
}
```
**Queue Architecture:**
- **Object Store Collection Management Queue**: Handles collection operations for object/document stores
- **Vector Store Collection Management Queue**: Handles collection operations for vector/embedding stores
- **Triple Store Collection Management Queue**: Handles collection operations for graph/RDF stores
- **Vector Store Management Queue** (`vector-storage-management`): Vector/embedding stores
- **Object Store Management Queue** (`object-storage-management`): Object/document stores
- **Triple Store Management Queue** (`triples-storage-management`): Graph/RDF stores
- **Storage Response Queue** (`storage-management-response`): All responses sent here
Each store writer implements:
- **Collection Management Handler**: Separate from standard data storage handlers
- **Delete Collection Operation**: Removes all data associated with the specified collection
- **Message Processing**: Consumes from dedicated collection management queue
- **Status Reporting**: Returns success/failure status for coordination
- **Idempotent Operations**: Handles cases where collection doesn't exist (no-op)
- **Collection Management Handler**: Processes `StorageManagementRequest` messages
- **Create Collection Operation**: Establishes collection in storage backend
- **Delete Collection Operation**: Removes all data associated with collection
- **Collection State Tracking**: Maintains knowledge of which collections exist
- **Message Processing**: Consumes from dedicated management queue
- **Status Reporting**: Returns success/failure via `StorageManagementResponse`
- **Idempotent Operations**: Safe to call create/delete multiple times
**Initial Implementation:**
Only `delete-collection` operation will be implemented initially. The interface supports future operations like `archive-collection`, `migrate-collection`, etc.
**Supported Operations:**
- `create-collection`: Create collection in storage backend
- `delete-collection`: Remove all collection data from storage backend
#### Cassandra Triple Store Refactor
@ -244,13 +312,11 @@ As part of this implementation, the Cassandra triple store will be refactored fr
- Maintain same query logic with collection parameter
**Benefits:**
- **Simplified Collection Deletion**: Simple `DELETE FROM triples WHERE collection = ?` instead of dropping tables
- **Simplified Collection Deletion**: Delete using `collection` partition key across all 4 tables
- **Resource Efficiency**: Fewer database connections and table objects
- **Cross-Collection Operations**: Easier to implement operations spanning multiple collections
- **Consistent Architecture**: Aligns with unified collection metadata approach
**Migration Strategy:**
Existing table-per-collection data will need migration to the new unified schema during the upgrade process.
- **Collection Validation**: Easy to check collection existence via `triples_collection` table
Collection operations will be atomic where possible and provide appropriate error handling and validation.
@ -264,37 +330,25 @@ Collection listing operations may need pagination for environments with large nu
## Testing Strategy
Comprehensive testing will cover collection lifecycle operations, metadata management, and CLI command functionality with both unit and integration tests.
## Migration Plan
This implementation requires both metadata and storage migrations:
### Collection Metadata Migration
Existing collections will need to be registered in the new Cassandra collections metadata table. A migration process will:
- Scan existing keyspaces and tables to identify collections
- Create metadata records with default values (name=collection_id, empty description/tags)
- Preserve creation timestamps where possible
### Cassandra Triple Store Migration
The Cassandra storage refactor requires data migration from table-per-collection to unified table:
- **Pre-migration**: Identify all user keyspaces and collection tables
- **Data Transfer**: Copy triples from individual collection tables to unified "triples" table with collection
- **Schema Validation**: Ensure new primary key structure maintains query performance
- **Cleanup**: Remove old collection tables after successful migration
- **Rollback Plan**: Maintain ability to restore table-per-collection structure if needed
Migration will be performed during a maintenance window to ensure data consistency.
Comprehensive testing will cover:
- Collection creation workflow end-to-end
- Storage backend synchronization
- Write validation for non-existent collections
- Query handling of non-existent collections
- Collection deletion cascade across all stores
- Error handling and recovery scenarios
- Unit tests for each storage backend
- Integration tests for cross-store operations
## Implementation Status
### ✅ Completed Components
1. **Librarian Collection Management Service** (`trustgraph-flow/trustgraph/librarian/collection_service.py`)
- Complete collection CRUD operations (list, update, delete)
1. **Librarian Collection Management Service** (`trustgraph-flow/trustgraph/librarian/collection_manager.py`)
- Collection metadata CRUD operations (list, update, delete)
- Cassandra collection metadata table integration via `LibraryTableStore`
- Async request/response handling with proper error management
- Collection deletion cascade coordination across all storage types
- Async request/response handling with proper error management
2. **Collection Metadata Schema** (`trustgraph-base/trustgraph/schema/services/collection.py`)
- `CollectionManagementRequest` and `CollectionManagementResponse` schemas
@ -303,47 +357,70 @@ Migration will be performed during a maintenance window to ensure data consisten
3. **Storage Management Schema** (`trustgraph-base/trustgraph/schema/services/storage.py`)
- `StorageManagementRequest` and `StorageManagementResponse` schemas
- Storage management queue topics defined
- Message format for storage-level collection operations
### ❌ Missing Components
4. **Cassandra 4-Table Schema** (`trustgraph-flow/trustgraph/direct/cassandra_kg.py`)
- Compound partition keys for query performance
- `triples_collection` table for SPO queries and deletion tracking
- Collection deletion implemented with read-then-delete pattern
1. **Storage Management Queue Topics**
- Missing topic definitions in schema for:
- `vector_storage_management_topic`
- `object_storage_management_topic`
- `triples_storage_management_topic`
- `storage_management_response_topic`
- These are referenced by the librarian service but not yet defined
### 🔄 In Progress Components
2. **Store Collection Management Handlers**
- **Vector Store Writers** (Qdrant, Milvus, Pinecone): No collection deletion handlers
- **Object Store Writers** (Cassandra): No collection deletion handlers
- **Triple Store Writers** (Cassandra, Neo4j, Memgraph, FalkorDB): No collection deletion handlers
- Need to implement `StorageManagementRequest` processing in each store writer
1. **Collection Creation Broadcast** (`trustgraph-flow/trustgraph/librarian/collection_manager.py`)
- Update `update_collection()` to send "create-collection" to storage backends
- Wait for confirmations from all storage processors
- Handle creation failures appropriately
3. **Collection Management Interface Implementation**
- Store writers need collection management message consumers
- Collection deletion operations need to be implemented per store type
- Response handling back to librarian service
2. **Document Submission Handler** (`trustgraph-flow/trustgraph/librarian/service.py` or similar)
- Check if collection exists when document submitted
- If not exists: Create collection with defaults before processing document
- Trigger same "create-collection" broadcast as `tg-set-collection`
- Ensure collection established before document flows to storage processors
### ❌ Pending Components
1. **Collection State Tracking** - Need to implement in each storage backend:
- **Cassandra Triples**: Use `triples_collection` table with marker triples
- **Neo4j/Memgraph/FalkorDB**: Create `:CollectionMetadata` nodes
- **Qdrant/Milvus/Pinecone**: Use native collection APIs
- **Cassandra Objects**: Add collection metadata tracking
2. **Storage Management Handlers** - Need "create-collection" support in 12 files:
- `trustgraph-flow/trustgraph/storage/triples/cassandra/write.py`
- `trustgraph-flow/trustgraph/storage/triples/neo4j/write.py`
- `trustgraph-flow/trustgraph/storage/triples/memgraph/write.py`
- `trustgraph-flow/trustgraph/storage/triples/falkordb/write.py`
- `trustgraph-flow/trustgraph/storage/doc_embeddings/qdrant/write.py`
- `trustgraph-flow/trustgraph/storage/graph_embeddings/qdrant/write.py`
- `trustgraph-flow/trustgraph/storage/doc_embeddings/milvus/write.py`
- `trustgraph-flow/trustgraph/storage/graph_embeddings/milvus/write.py`
- `trustgraph-flow/trustgraph/storage/doc_embeddings/pinecone/write.py`
- `trustgraph-flow/trustgraph/storage/graph_embeddings/pinecone/write.py`
- `trustgraph-flow/trustgraph/storage/objects/cassandra/write.py`
- Plus any other storage implementations
3. **Write Operation Validation** - Add collection existence checks to all `store_*` methods
4. **Query Operation Handling** - Update queries to return empty for non-existent collections
### Next Implementation Steps
1. **Define Storage Management Topics** in `trustgraph-base/trustgraph/schema/services/storage.py`
2. **Implement Collection Management Handlers** in each storage writer:
- Add `StorageManagementRequest` consumers
- Implement collection deletion operations
- Add response producers for status reporting
3. **Test End-to-End Collection Deletion** across all storage types
**Phase 1: Core Infrastructure (2-3 days)**
1. Add collection state tracking methods to all storage backends
2. Implement `collection_exists()` and `create_collection()` methods
## Timeline
**Phase 2: Storage Handlers (1 week)**
3. Add "create-collection" handlers to all storage processors
4. Add write validation to reject non-existent collections
5. Update query handling for non-existent collections
Phase 1 (Storage Topics): 1-2 days
Phase 2 (Store Handlers): 1-2 weeks depending on number of storage backends
Phase 3 (Testing & Integration): 3-5 days
**Phase 3: Collection Manager (2-3 days)**
6. Update collection_manager to broadcast creates
7. Implement response tracking and error handling
## Open Questions
- Should collection deletion be soft or hard delete by default?
- What metadata fields should be required vs optional?
- Should we implement storage management handlers incrementally by store type?
**Phase 4: Testing (3-5 days)**
8. End-to-end testing of explicit creation workflow
9. Test all storage backends
10. Validate error handling and edge cases