mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 00:16:23 +02:00
Collection delete pt. 3 (#542)
* Fixing collection deletion * Fixing collection management param error * Always test for collections * Add Cassandra collection table * Updated tech spec for explicit creation/deletion * Remove implicit collection creation * Fix up collection tracking in all processors
This commit is contained in:
parent
dc79b10552
commit
52b133fc86
31 changed files with 1761 additions and 843 deletions
|
|
@ -158,17 +158,17 @@ The current primary key `PRIMARY KEY (collection, s, p, o)` provides minimal clu
|
|||
- Uneven load distribution across cluster nodes
|
||||
- Scalability bottlenecks as collections grow
|
||||
|
||||
## Proposed Solution: Multi-Table Denormalization Strategy
|
||||
## Proposed Solution: 4-Table Denormalization Strategy
|
||||
|
||||
### Overview
|
||||
|
||||
Replace the single `triples` table with three purpose-built tables, each optimized for specific query patterns. This eliminates the need for secondary indexes and ALLOW FILTERING while providing optimal performance for all query types.
|
||||
Replace the single `triples` table with four purpose-built tables, each optimized for specific query patterns. This eliminates the need for secondary indexes and ALLOW FILTERING while providing optimal performance for all query types. The fourth table enables efficient collection deletion despite compound partition keys.
|
||||
|
||||
### New Schema Design
|
||||
|
||||
**Table 1: Subject-Centric Queries**
|
||||
**Table 1: Subject-Centric Queries (triples_s)**
|
||||
```sql
|
||||
CREATE TABLE triples_by_subject (
|
||||
CREATE TABLE triples_s (
|
||||
collection text,
|
||||
s text,
|
||||
p text,
|
||||
|
|
@ -176,13 +176,13 @@ CREATE TABLE triples_by_subject (
|
|||
PRIMARY KEY ((collection, s), p, o)
|
||||
);
|
||||
```
|
||||
- **Optimizes:** get_s, get_sp, get_spo, get_os
|
||||
- **Optimizes:** get_s, get_sp, get_os
|
||||
- **Partition Key:** (collection, s) - Better distribution than collection alone
|
||||
- **Clustering:** (p, o) - Enables efficient predicate/object lookups for a subject
|
||||
|
||||
**Table 2: Predicate-Object Queries**
|
||||
**Table 2: Predicate-Object Queries (triples_p)**
|
||||
```sql
|
||||
CREATE TABLE triples_by_po (
|
||||
CREATE TABLE triples_p (
|
||||
collection text,
|
||||
p text,
|
||||
o text,
|
||||
|
|
@ -194,9 +194,9 @@ CREATE TABLE triples_by_po (
|
|||
- **Partition Key:** (collection, p) - Direct access by predicate
|
||||
- **Clustering:** (o, s) - Efficient object-subject traversal
|
||||
|
||||
**Table 3: Object-Centric Queries**
|
||||
**Table 3: Object-Centric Queries (triples_o)**
|
||||
```sql
|
||||
CREATE TABLE triples_by_object (
|
||||
CREATE TABLE triples_o (
|
||||
collection text,
|
||||
o text,
|
||||
s text,
|
||||
|
|
@ -204,30 +204,72 @@ CREATE TABLE triples_by_object (
|
|||
PRIMARY KEY ((collection, o), s, p)
|
||||
);
|
||||
```
|
||||
- **Optimizes:** get_o, get_os
|
||||
- **Optimizes:** get_o
|
||||
- **Partition Key:** (collection, o) - Direct access by object
|
||||
- **Clustering:** (s, p) - Efficient subject-predicate traversal
|
||||
|
||||
**Table 4: Collection Management & SPO Queries (triples_collection)**
|
||||
```sql
|
||||
CREATE TABLE triples_collection (
|
||||
collection text,
|
||||
s text,
|
||||
p text,
|
||||
o text,
|
||||
PRIMARY KEY (collection, s, p, o)
|
||||
);
|
||||
```
|
||||
- **Optimizes:** get_spo, delete_collection
|
||||
- **Partition Key:** collection only - Enables efficient collection-level operations
|
||||
- **Clustering:** (s, p, o) - Standard triple ordering
|
||||
- **Purpose:** Dual use for exact SPO lookups and as deletion index
|
||||
|
||||
### Query Mapping
|
||||
|
||||
| Original Query | Target Table | Performance Improvement |
|
||||
|----------------|-------------|------------------------|
|
||||
| get_all(collection) | triples_by_subject | Token-based pagination |
|
||||
| get_s(collection, s) | triples_by_subject | Direct partition access |
|
||||
| get_p(collection, p) | triples_by_po | Direct partition access |
|
||||
| get_o(collection, o) | triples_by_object | Direct partition access |
|
||||
| get_sp(collection, s, p) | triples_by_subject | Partition + clustering |
|
||||
| get_po(collection, p, o) | triples_by_po | **No more ALLOW FILTERING!** |
|
||||
| get_os(collection, o, s) | triples_by_subject | Partition + clustering |
|
||||
| get_spo(collection, s, p, o) | triples_by_subject | Exact key lookup |
|
||||
| get_all(collection) | triples_s | ALLOW FILTERING (acceptable for scan) |
|
||||
| get_s(collection, s) | triples_s | Direct partition access |
|
||||
| get_p(collection, p) | triples_p | Direct partition access |
|
||||
| get_o(collection, o) | triples_o | Direct partition access |
|
||||
| get_sp(collection, s, p) | triples_s | Partition + clustering |
|
||||
| get_po(collection, p, o) | triples_p | **No more ALLOW FILTERING!** |
|
||||
| get_os(collection, o, s) | triples_o | Partition + clustering |
|
||||
| get_spo(collection, s, p, o) | triples_collection | Exact key lookup |
|
||||
| delete_collection(collection) | triples_collection | Read index, batch delete all |
|
||||
|
||||
### Collection Deletion Strategy
|
||||
|
||||
With compound partition keys, we cannot simply execute `DELETE FROM table WHERE collection = ?`. Instead:
|
||||
|
||||
1. **Read Phase:** Query `triples_collection` to enumerate all triples:
|
||||
```sql
|
||||
SELECT s, p, o FROM triples_collection WHERE collection = ?
|
||||
```
|
||||
This is efficient since `collection` is the partition key for this table.
|
||||
|
||||
2. **Delete Phase:** For each triple (s, p, o), delete from all 4 tables using full partition keys:
|
||||
```sql
|
||||
DELETE FROM triples_s WHERE collection = ? AND s = ? AND p = ? AND o = ?
|
||||
DELETE FROM triples_p WHERE collection = ? AND p = ? AND o = ? AND s = ?
|
||||
DELETE FROM triples_o WHERE collection = ? AND o = ? AND s = ? AND p = ?
|
||||
DELETE FROM triples_collection WHERE collection = ? AND s = ? AND p = ? AND o = ?
|
||||
```
|
||||
Batched in groups of 100 for efficiency.
|
||||
|
||||
**Trade-off Analysis:**
|
||||
- ✅ Maintains optimal query performance with distributed partitions
|
||||
- ✅ No hot partitions for large collections
|
||||
- ❌ More complex deletion logic (read-then-delete)
|
||||
- ❌ Deletion time proportional to collection size
|
||||
|
||||
### Benefits
|
||||
|
||||
1. **Eliminates ALLOW FILTERING** - Every query has an optimal access path
|
||||
1. **Eliminates ALLOW FILTERING** - Every query has an optimal access path (except get_all scan)
|
||||
2. **No Secondary Indexes** - Each table IS the index for its query pattern
|
||||
3. **Better Data Distribution** - Composite partition keys spread load effectively
|
||||
4. **Predictable Performance** - Query time proportional to result size, not total data
|
||||
5. **Leverages Cassandra Strengths** - Designed for Cassandra's architecture
|
||||
6. **Enables Collection Deletion** - triples_collection serves as deletion index
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
|
|
@ -295,10 +337,11 @@ def delete_collection(self, collection) -> None # Delete from all three tables
|
|||
### Implementation Strategy
|
||||
|
||||
#### Phase 1: Schema and Core Methods
|
||||
1. **Rewrite `init()` method** - Create three tables instead of one
|
||||
2. **Rewrite `insert()` method** - Batch writes to all three tables
|
||||
1. **Rewrite `init()` method** - Create four tables instead of one
|
||||
2. **Rewrite `insert()` method** - Batch writes to all four tables
|
||||
3. **Implement prepared statements** - For optimal performance
|
||||
4. **Add table routing logic** - Direct queries to optimal tables
|
||||
5. **Implement collection deletion** - Read from triples_collection, batch delete from all tables
|
||||
|
||||
#### Phase 2: Query Method Optimization
|
||||
1. **Rewrite each get_* method** to use optimal table
|
||||
|
|
@ -318,18 +361,11 @@ def delete_collection(self, collection) -> None # Delete from all three tables
|
|||
def insert(self, collection, s, p, o):
|
||||
batch = BatchStatement()
|
||||
|
||||
# Insert into all three tables
|
||||
batch.add(SimpleStatement(
|
||||
"INSERT INTO triples_by_subject (collection, s, p, o) VALUES (?, ?, ?, ?)"
|
||||
), (collection, s, p, o))
|
||||
|
||||
batch.add(SimpleStatement(
|
||||
"INSERT INTO triples_by_po (collection, p, o, s) VALUES (?, ?, ?, ?)"
|
||||
), (collection, p, o, s))
|
||||
|
||||
batch.add(SimpleStatement(
|
||||
"INSERT INTO triples_by_object (collection, o, s, p) VALUES (?, ?, ?, ?)"
|
||||
), (collection, o, s, p))
|
||||
# Insert into all four tables
|
||||
batch.add(self.insert_subject_stmt, (collection, s, p, o))
|
||||
batch.add(self.insert_po_stmt, (collection, p, o, s))
|
||||
batch.add(self.insert_object_stmt, (collection, o, s, p))
|
||||
batch.add(self.insert_collection_stmt, (collection, s, p, o))
|
||||
|
||||
self.session.execute(batch)
|
||||
```
|
||||
|
|
@ -337,11 +373,65 @@ def insert(self, collection, s, p, o):
|
|||
#### Query Routing Logic
|
||||
```python
|
||||
def get_po(self, collection, p, o, limit=10):
|
||||
# Route to triples_by_po table - NO ALLOW FILTERING!
|
||||
# Route to triples_p table - NO ALLOW FILTERING!
|
||||
return self.session.execute(
|
||||
"SELECT s FROM triples_by_po WHERE collection = ? AND p = ? AND o = ? LIMIT ?",
|
||||
self.get_po_stmt,
|
||||
(collection, p, o, limit)
|
||||
)
|
||||
|
||||
def get_spo(self, collection, s, p, o, limit=10):
|
||||
# Route to triples_collection table for exact SPO lookup
|
||||
return self.session.execute(
|
||||
self.get_spo_stmt,
|
||||
(collection, s, p, o, limit)
|
||||
)
|
||||
```
|
||||
|
||||
#### Collection Deletion Logic
|
||||
```python
|
||||
def delete_collection(self, collection):
|
||||
# Step 1: Read all triples from collection table
|
||||
rows = self.session.execute(
|
||||
f"SELECT s, p, o FROM {self.collection_table} WHERE collection = %s",
|
||||
(collection,)
|
||||
)
|
||||
|
||||
# Step 2: Batch delete from all 4 tables
|
||||
batch = BatchStatement()
|
||||
count = 0
|
||||
|
||||
for row in rows:
|
||||
s, p, o = row.s, row.p, row.o
|
||||
|
||||
# Delete using full partition keys for each table
|
||||
batch.add(SimpleStatement(
|
||||
f"DELETE FROM {self.subject_table} WHERE collection = ? AND s = ? AND p = ? AND o = ?"
|
||||
), (collection, s, p, o))
|
||||
|
||||
batch.add(SimpleStatement(
|
||||
f"DELETE FROM {self.po_table} WHERE collection = ? AND p = ? AND o = ? AND s = ?"
|
||||
), (collection, p, o, s))
|
||||
|
||||
batch.add(SimpleStatement(
|
||||
f"DELETE FROM {self.object_table} WHERE collection = ? AND o = ? AND s = ? AND p = ?"
|
||||
), (collection, o, s, p))
|
||||
|
||||
batch.add(SimpleStatement(
|
||||
f"DELETE FROM {self.collection_table} WHERE collection = ? AND s = ? AND p = ? AND o = ?"
|
||||
), (collection, s, p, o))
|
||||
|
||||
count += 1
|
||||
|
||||
# Execute every 100 triples to avoid oversized batches
|
||||
if count % 100 == 0:
|
||||
self.session.execute(batch)
|
||||
batch = BatchStatement()
|
||||
|
||||
# Execute remaining deletions
|
||||
if count % 100 != 0:
|
||||
self.session.execute(batch)
|
||||
|
||||
logger.info(f"Deleted {count} triples from collection {collection}")
|
||||
```
|
||||
|
||||
#### Prepared Statement Optimization
|
||||
|
|
@ -349,12 +439,18 @@ def get_po(self, collection, p, o, limit=10):
|
|||
def prepare_statements(self):
|
||||
# Cache prepared statements for better performance
|
||||
self.insert_subject_stmt = self.session.prepare(
|
||||
"INSERT INTO triples_by_subject (collection, s, p, o) VALUES (?, ?, ?, ?)"
|
||||
f"INSERT INTO {self.subject_table} (collection, s, p, o) VALUES (?, ?, ?, ?)"
|
||||
)
|
||||
self.insert_po_stmt = self.session.prepare(
|
||||
"INSERT INTO triples_by_po (collection, p, o, s) VALUES (?, ?, ?, ?)"
|
||||
f"INSERT INTO {self.po_table} (collection, p, o, s) VALUES (?, ?, ?, ?)"
|
||||
)
|
||||
# ... etc for all tables and queries
|
||||
self.insert_object_stmt = self.session.prepare(
|
||||
f"INSERT INTO {self.object_table} (collection, o, s, p) VALUES (?, ?, ?, ?)"
|
||||
)
|
||||
self.insert_collection_stmt = self.session.prepare(
|
||||
f"INSERT INTO {self.collection_table} (collection, s, p, o) VALUES (?, ?, ?, ?)"
|
||||
)
|
||||
# ... query statements
|
||||
```
|
||||
|
||||
## Migration Strategy
|
||||
|
|
@ -511,9 +607,10 @@ def rollback_to_legacy():
|
|||
## Risks and Considerations
|
||||
|
||||
### Performance Risks
|
||||
- **Write latency increase** - 3x write operations per insert
|
||||
- **Storage overhead** - 3x storage requirement
|
||||
- **Write latency increase** - 4x write operations per insert (33% more than 3-table approach)
|
||||
- **Storage overhead** - 4x storage requirement (33% more than 3-table approach)
|
||||
- **Batch write failures** - Need proper error handling
|
||||
- **Deletion complexity** - Collection deletion requires read-then-delete loop
|
||||
|
||||
### Operational Risks
|
||||
- **Migration complexity** - Data migration for large datasets
|
||||
|
|
|
|||
|
|
@ -2,16 +2,17 @@
|
|||
|
||||
## Overview
|
||||
|
||||
This specification describes the collection management capabilities for TrustGraph, enabling users to have explicit control over collections that are currently implicitly created during data loading and querying operations. The feature supports four primary use cases:
|
||||
This specification describes the collection management capabilities for TrustGraph, requiring explicit collection creation and providing direct control over the collection lifecycle. Collections must be explicitly created before use, ensuring proper synchronization between the librarian metadata and all storage backends. The feature supports four primary use cases:
|
||||
|
||||
1. **Collection Listing**: View all existing collections in the system
|
||||
2. **Collection Deletion**: Remove unwanted collections and their associated data
|
||||
3. **Collection Labeling**: Associate descriptive labels with collections for better organization
|
||||
4. **Collection Tagging**: Apply tags to collections for categorization and easier discovery
|
||||
1. **Collection Creation**: Explicitly create collections before storing data
|
||||
2. **Collection Listing**: View all existing collections in the system
|
||||
3. **Collection Metadata Management**: Update collection names, descriptions, and tags
|
||||
4. **Collection Deletion**: Remove collections and their associated data across all storage types
|
||||
|
||||
## Goals
|
||||
|
||||
- **Explicit Collection Control**: Provide users with direct management capabilities over collections beyond implicit creation
|
||||
- **Explicit Collection Creation**: Require collections to be created before data can be stored
|
||||
- **Storage Synchronization**: Ensure collections exist in all storage backends (vectors, objects, triples)
|
||||
- **Collection Visibility**: Enable users to list and inspect all collections in their environment
|
||||
- **Collection Cleanup**: Allow deletion of collections that are no longer needed
|
||||
- **Collection Organization**: Support labels and tags for better collection tracking and discovery
|
||||
|
|
@ -19,22 +20,25 @@ This specification describes the collection management capabilities for TrustGra
|
|||
- **Collection Discovery**: Make it easier to find specific collections through filtering and search
|
||||
- **Operational Transparency**: Provide clear visibility into collection lifecycle and usage
|
||||
- **Resource Management**: Enable cleanup of unused collections to optimize resource utilization
|
||||
- **Data Integrity**: Prevent orphaned collections in storage without metadata tracking
|
||||
|
||||
## Background
|
||||
|
||||
Currently, collections in TrustGraph are implicitly created during data loading operations and query execution. While this provides convenience for users, it lacks the explicit control needed for production environments and long-term data management.
|
||||
Previously, collections in TrustGraph were implicitly created during data loading operations, leading to synchronization issues where collections could exist in storage backends without corresponding metadata in the librarian. This created management challenges and potential orphaned data.
|
||||
|
||||
Current limitations include:
|
||||
- No way to list existing collections
|
||||
- No mechanism to delete unwanted collections
|
||||
- No ability to associate metadata with collections for tracking purposes
|
||||
- Difficulty in organizing and discovering collections over time
|
||||
The explicit collection creation model addresses these issues by:
|
||||
- Requiring collections to be created before use via `tg-set-collection`
|
||||
- Broadcasting collection creation to all storage backends
|
||||
- Maintaining synchronized state between librarian metadata and storage
|
||||
- Preventing writes to non-existent collections
|
||||
- Providing clear collection lifecycle management
|
||||
|
||||
This specification addresses these gaps by introducing explicit collection management operations. By providing collection management APIs and commands, TrustGraph can:
|
||||
- Give users full control over their collection lifecycle
|
||||
- Enable better organization through labels and tags
|
||||
- Support collection cleanup for resource optimization
|
||||
- Improve operational visibility and management
|
||||
This specification defines the explicit collection management model. By requiring explicit collection creation, TrustGraph ensures:
|
||||
- Collections are tracked in librarian metadata from creation
|
||||
- All storage backends are aware of collections before receiving data
|
||||
- No orphaned collections exist in storage
|
||||
- Clear operational visibility and control over collection lifecycle
|
||||
- Consistent error handling when operations reference non-existent collections
|
||||
|
||||
## Technical Design
|
||||
|
||||
|
|
@ -98,24 +102,52 @@ This approach allows:
|
|||
|
||||
#### Collection Lifecycle
|
||||
|
||||
Collections follow a lazy-creation pattern that aligns with existing TrustGraph behavior:
|
||||
Collections are explicitly created in the librarian before data operations can proceed:
|
||||
|
||||
1. **Lazy Creation**: Collections are automatically created when first referenced during data loading or query operations. No explicit create operation is needed.
|
||||
1. **Collection Creation** (Two Paths):
|
||||
|
||||
2. **Implicit Registration**: When a collection is used (data loading, querying), the system checks if a metadata record exists. If not, a new record is created with default values:
|
||||
- `name`: defaults to collection_id
|
||||
- `description`: empty
|
||||
- `tags`: empty set
|
||||
- `created_at`: current timestamp
|
||||
**Path A: User-Initiated Creation** via `tg-set-collection`:
|
||||
- User provides collection ID, name, description, and tags
|
||||
- Librarian creates metadata record in `collections` table
|
||||
- Librarian broadcasts "create-collection" to all storage backends
|
||||
- All storage processors create collection and confirm success
|
||||
- Collection is now ready for data operations
|
||||
|
||||
3. **Explicit Updates**: Users can update collection metadata (name, description, tags) through management operations after lazy creation.
|
||||
**Path B: Automatic Creation on Document Submission**:
|
||||
- User submits document specifying a collection ID
|
||||
- Librarian checks if collection exists in metadata table
|
||||
- If not exists: Librarian creates metadata with defaults (name=collection_id, empty description/tags)
|
||||
- Librarian broadcasts "create-collection" to all storage backends
|
||||
- All storage processors create collection and confirm success
|
||||
- Document processing proceeds with collection now established
|
||||
|
||||
4. **Explicit Deletion**: Users can delete collections, which removes both the metadata record and the underlying collection data across all store types.
|
||||
Both paths ensure collection exists in librarian metadata AND all storage backends before data operations.
|
||||
|
||||
5. **Multi-Store Deletion**: Collection deletion cascades across all storage backends (vector stores, object stores, triple stores) as each implements lazy creation and must support collection deletion.
|
||||
2. **Storage Validation**: Write operations validate collection exists:
|
||||
- Storage processors check collection state before accepting writes
|
||||
- Writes to non-existent collections return error
|
||||
- This prevents direct writes bypassing the librarian's collection creation logic
|
||||
|
||||
3. **Query Behavior**: Query operations handle non-existent collections gracefully:
|
||||
- Queries to non-existent collections return empty results
|
||||
- No error thrown for query operations
|
||||
- Allows exploration without requiring collection to exist
|
||||
|
||||
4. **Metadata Updates**: Users can update collection metadata after creation:
|
||||
- Update name, description, and tags via `tg-set-collection`
|
||||
- Updates apply to librarian metadata only
|
||||
- Storage backends maintain collection but metadata updates don't propagate
|
||||
|
||||
5. **Explicit Deletion**: Users delete collections via `tg-delete-collection`:
|
||||
- Librarian broadcasts "delete-collection" to all storage backends
|
||||
- Waits for confirmation from all storage processors
|
||||
- Deletes librarian metadata record only after storage cleanup complete
|
||||
- Ensures no orphaned data remains in storage
|
||||
|
||||
**Key Principle**: The librarian is the single point of control for collection creation. Whether initiated by user command or document submission, the librarian ensures proper metadata tracking and storage backend synchronization before allowing data operations.
|
||||
|
||||
Operations required:
|
||||
- **Collection Use Notification**: Internal operation triggered during data loading/querying to ensure metadata record exists
|
||||
- **Create Collection**: User operation via `tg-set-collection` OR automatic on document submission
|
||||
- **Update Collection Metadata**: User operation to modify name, description, and tags
|
||||
- **Delete Collection**: User operation to remove collection and its data across all stores
|
||||
- **List Collections**: User operation to view collections with filtering by tags
|
||||
|
|
@ -123,32 +155,65 @@ Operations required:
|
|||
#### Multi-Store Collection Management
|
||||
|
||||
Collections exist across multiple storage backends in TrustGraph:
|
||||
- **Vector Stores**: Store embeddings and vector data for collections
|
||||
- **Object Stores**: Store documents and file data for collections
|
||||
- **Triple Stores**: Store graph/RDF data for collections
|
||||
- **Vector Stores** (Qdrant, Milvus, Pinecone): Store embeddings and vector data
|
||||
- **Object Stores** (Cassandra): Store documents and file data
|
||||
- **Triple Stores** (Cassandra, Neo4j, Memgraph, FalkorDB): Store graph/RDF data
|
||||
|
||||
Each store type implements:
|
||||
- **Lazy Creation**: Collections are created implicitly when data is first stored
|
||||
- **Collection Deletion**: Store-specific deletion operations to remove collection data
|
||||
- **Collection State Tracking**: Maintain knowledge of which collections exist
|
||||
- **Collection Creation**: Accept and process "create-collection" operations
|
||||
- **Collection Validation**: Check collection exists before accepting writes
|
||||
- **Collection Deletion**: Remove all data for specified collection
|
||||
|
||||
The librarian service coordinates collection operations across all store types, ensuring consistent collection lifecycle management.
|
||||
The librarian service coordinates collection operations across all store types, ensuring:
|
||||
- Collections created in all backends before use
|
||||
- All backends confirm creation before returning success
|
||||
- Synchronized collection lifecycle across storage types
|
||||
- Consistent error handling when collections don't exist
|
||||
|
||||
#### Collection State Tracking by Storage Type
|
||||
|
||||
Each storage backend tracks collection state differently based on its capabilities:
|
||||
|
||||
**Cassandra Triple Store:**
|
||||
- Uses existing `triples_collection` table
|
||||
- Creates system marker triple when collection created
|
||||
- Query: `SELECT collection FROM triples_collection WHERE collection = ? LIMIT 1`
|
||||
- Efficient single-partition check for collection existence
|
||||
|
||||
**Qdrant/Milvus/Pinecone Vector Stores:**
|
||||
- Native collection APIs provide existence checking
|
||||
- Collections created with proper vector configuration
|
||||
- `collection_exists()` method uses storage API
|
||||
- Collection creation validates dimension requirements
|
||||
|
||||
**Neo4j/Memgraph/FalkorDB Graph Stores:**
|
||||
- Use `:CollectionMetadata` nodes to track collections
|
||||
- Node properties: `{user, collection, created_at}`
|
||||
- Query: `MATCH (c:CollectionMetadata {user: $user, collection: $collection})`
|
||||
- Separate from data nodes for clean separation
|
||||
- Enables efficient collection listing and validation
|
||||
|
||||
**Cassandra Object Store:**
|
||||
- Uses collection metadata table or marker rows
|
||||
- Similar pattern to triple store
|
||||
- Validates collection before document writes
|
||||
|
||||
### APIs
|
||||
|
||||
New APIs:
|
||||
Collection Management APIs (Librarian):
|
||||
- **Create/Update Collection**: Create new collection or update existing metadata via `tg-set-collection`
|
||||
- **List Collections**: Retrieve collections for a user with optional tag filtering
|
||||
- **Update Collection Metadata**: Modify collection name, description, and tags
|
||||
- **Delete Collection**: Remove collection and associated data with confirmation, cascading to all store types
|
||||
- **Collection Use Notification** (Internal): Ensure metadata record exists when collection is referenced
|
||||
- **Delete Collection**: Remove collection and associated data, cascading to all store types
|
||||
|
||||
Store Writer APIs (Enhanced):
|
||||
- **Vector Store Collection Deletion**: Remove vector data for specified user and collection
|
||||
- **Object Store Collection Deletion**: Remove object/document data for specified user and collection
|
||||
- **Triple Store Collection Deletion**: Remove graph/RDF data for specified user and collection
|
||||
Storage Management APIs (All Storage Processors):
|
||||
- **Create Collection**: Handle "create-collection" operation, establish collection in storage
|
||||
- **Delete Collection**: Handle "delete-collection" operation, remove all collection data
|
||||
- **Collection Exists Check**: Internal validation before accepting write operations
|
||||
|
||||
Modified APIs:
|
||||
- **Data Loading APIs**: Enhanced to trigger collection use notification for lazy metadata creation
|
||||
- **Query APIs**: Enhanced to trigger collection use notification and optionally include metadata in responses
|
||||
Data Operation APIs (Modified Behavior):
|
||||
- **Write APIs**: Validate collection exists before accepting data, return error if not
|
||||
- **Query APIs**: Return empty results for non-existent collections without error
|
||||
|
||||
### Implementation Details
|
||||
|
||||
|
|
@ -168,32 +233,35 @@ When a user initiates collection deletion through the librarian service:
|
|||
|
||||
#### Collection Management Interface
|
||||
|
||||
All store writers will implement a standardized collection management interface with a common schema across store types:
|
||||
All store writers implement a standardized collection management interface with a common schema:
|
||||
|
||||
**Message Schema:**
|
||||
**Message Schema (`StorageManagementRequest`):**
|
||||
```json
|
||||
{
|
||||
"operation": "delete-collection",
|
||||
"operation": "create-collection" | "delete-collection",
|
||||
"user": "user123",
|
||||
"collection": "documents-2024",
|
||||
"timestamp": "2024-01-15T10:30:00Z"
|
||||
"collection": "documents-2024"
|
||||
}
|
||||
```
|
||||
|
||||
**Queue Architecture:**
|
||||
- **Object Store Collection Management Queue**: Handles collection operations for object/document stores
|
||||
- **Vector Store Collection Management Queue**: Handles collection operations for vector/embedding stores
|
||||
- **Triple Store Collection Management Queue**: Handles collection operations for graph/RDF stores
|
||||
- **Vector Store Management Queue** (`vector-storage-management`): Vector/embedding stores
|
||||
- **Object Store Management Queue** (`object-storage-management`): Object/document stores
|
||||
- **Triple Store Management Queue** (`triples-storage-management`): Graph/RDF stores
|
||||
- **Storage Response Queue** (`storage-management-response`): All responses sent here
|
||||
|
||||
Each store writer implements:
|
||||
- **Collection Management Handler**: Separate from standard data storage handlers
|
||||
- **Delete Collection Operation**: Removes all data associated with the specified collection
|
||||
- **Message Processing**: Consumes from dedicated collection management queue
|
||||
- **Status Reporting**: Returns success/failure status for coordination
|
||||
- **Idempotent Operations**: Handles cases where collection doesn't exist (no-op)
|
||||
- **Collection Management Handler**: Processes `StorageManagementRequest` messages
|
||||
- **Create Collection Operation**: Establishes collection in storage backend
|
||||
- **Delete Collection Operation**: Removes all data associated with collection
|
||||
- **Collection State Tracking**: Maintains knowledge of which collections exist
|
||||
- **Message Processing**: Consumes from dedicated management queue
|
||||
- **Status Reporting**: Returns success/failure via `StorageManagementResponse`
|
||||
- **Idempotent Operations**: Safe to call create/delete multiple times
|
||||
|
||||
**Initial Implementation:**
|
||||
Only `delete-collection` operation will be implemented initially. The interface supports future operations like `archive-collection`, `migrate-collection`, etc.
|
||||
**Supported Operations:**
|
||||
- `create-collection`: Create collection in storage backend
|
||||
- `delete-collection`: Remove all collection data from storage backend
|
||||
|
||||
#### Cassandra Triple Store Refactor
|
||||
|
||||
|
|
@ -244,13 +312,11 @@ As part of this implementation, the Cassandra triple store will be refactored fr
|
|||
- Maintain same query logic with collection parameter
|
||||
|
||||
**Benefits:**
|
||||
- **Simplified Collection Deletion**: Simple `DELETE FROM triples WHERE collection = ?` instead of dropping tables
|
||||
- **Simplified Collection Deletion**: Delete using `collection` partition key across all 4 tables
|
||||
- **Resource Efficiency**: Fewer database connections and table objects
|
||||
- **Cross-Collection Operations**: Easier to implement operations spanning multiple collections
|
||||
- **Consistent Architecture**: Aligns with unified collection metadata approach
|
||||
|
||||
**Migration Strategy:**
|
||||
Existing table-per-collection data will need migration to the new unified schema during the upgrade process.
|
||||
- **Collection Validation**: Easy to check collection existence via `triples_collection` table
|
||||
|
||||
Collection operations will be atomic where possible and provide appropriate error handling and validation.
|
||||
|
||||
|
|
@ -264,37 +330,25 @@ Collection listing operations may need pagination for environments with large nu
|
|||
|
||||
## Testing Strategy
|
||||
|
||||
Comprehensive testing will cover collection lifecycle operations, metadata management, and CLI command functionality with both unit and integration tests.
|
||||
|
||||
## Migration Plan
|
||||
|
||||
This implementation requires both metadata and storage migrations:
|
||||
|
||||
### Collection Metadata Migration
|
||||
Existing collections will need to be registered in the new Cassandra collections metadata table. A migration process will:
|
||||
- Scan existing keyspaces and tables to identify collections
|
||||
- Create metadata records with default values (name=collection_id, empty description/tags)
|
||||
- Preserve creation timestamps where possible
|
||||
|
||||
### Cassandra Triple Store Migration
|
||||
The Cassandra storage refactor requires data migration from table-per-collection to unified table:
|
||||
- **Pre-migration**: Identify all user keyspaces and collection tables
|
||||
- **Data Transfer**: Copy triples from individual collection tables to unified "triples" table with collection
|
||||
- **Schema Validation**: Ensure new primary key structure maintains query performance
|
||||
- **Cleanup**: Remove old collection tables after successful migration
|
||||
- **Rollback Plan**: Maintain ability to restore table-per-collection structure if needed
|
||||
|
||||
Migration will be performed during a maintenance window to ensure data consistency.
|
||||
Comprehensive testing will cover:
|
||||
- Collection creation workflow end-to-end
|
||||
- Storage backend synchronization
|
||||
- Write validation for non-existent collections
|
||||
- Query handling of non-existent collections
|
||||
- Collection deletion cascade across all stores
|
||||
- Error handling and recovery scenarios
|
||||
- Unit tests for each storage backend
|
||||
- Integration tests for cross-store operations
|
||||
|
||||
## Implementation Status
|
||||
|
||||
### ✅ Completed Components
|
||||
|
||||
1. **Librarian Collection Management Service** (`trustgraph-flow/trustgraph/librarian/collection_service.py`)
|
||||
- Complete collection CRUD operations (list, update, delete)
|
||||
1. **Librarian Collection Management Service** (`trustgraph-flow/trustgraph/librarian/collection_manager.py`)
|
||||
- Collection metadata CRUD operations (list, update, delete)
|
||||
- Cassandra collection metadata table integration via `LibraryTableStore`
|
||||
- Async request/response handling with proper error management
|
||||
- Collection deletion cascade coordination across all storage types
|
||||
- Async request/response handling with proper error management
|
||||
|
||||
2. **Collection Metadata Schema** (`trustgraph-base/trustgraph/schema/services/collection.py`)
|
||||
- `CollectionManagementRequest` and `CollectionManagementResponse` schemas
|
||||
|
|
@ -303,47 +357,70 @@ Migration will be performed during a maintenance window to ensure data consisten
|
|||
|
||||
3. **Storage Management Schema** (`trustgraph-base/trustgraph/schema/services/storage.py`)
|
||||
- `StorageManagementRequest` and `StorageManagementResponse` schemas
|
||||
- Storage management queue topics defined
|
||||
- Message format for storage-level collection operations
|
||||
|
||||
### ❌ Missing Components
|
||||
4. **Cassandra 4-Table Schema** (`trustgraph-flow/trustgraph/direct/cassandra_kg.py`)
|
||||
- Compound partition keys for query performance
|
||||
- `triples_collection` table for SPO queries and deletion tracking
|
||||
- Collection deletion implemented with read-then-delete pattern
|
||||
|
||||
1. **Storage Management Queue Topics**
|
||||
- Missing topic definitions in schema for:
|
||||
- `vector_storage_management_topic`
|
||||
- `object_storage_management_topic`
|
||||
- `triples_storage_management_topic`
|
||||
- `storage_management_response_topic`
|
||||
- These are referenced by the librarian service but not yet defined
|
||||
### 🔄 In Progress Components
|
||||
|
||||
2. **Store Collection Management Handlers**
|
||||
- **Vector Store Writers** (Qdrant, Milvus, Pinecone): No collection deletion handlers
|
||||
- **Object Store Writers** (Cassandra): No collection deletion handlers
|
||||
- **Triple Store Writers** (Cassandra, Neo4j, Memgraph, FalkorDB): No collection deletion handlers
|
||||
- Need to implement `StorageManagementRequest` processing in each store writer
|
||||
1. **Collection Creation Broadcast** (`trustgraph-flow/trustgraph/librarian/collection_manager.py`)
|
||||
- Update `update_collection()` to send "create-collection" to storage backends
|
||||
- Wait for confirmations from all storage processors
|
||||
- Handle creation failures appropriately
|
||||
|
||||
3. **Collection Management Interface Implementation**
|
||||
- Store writers need collection management message consumers
|
||||
- Collection deletion operations need to be implemented per store type
|
||||
- Response handling back to librarian service
|
||||
2. **Document Submission Handler** (`trustgraph-flow/trustgraph/librarian/service.py` or similar)
|
||||
- Check if collection exists when document submitted
|
||||
- If not exists: Create collection with defaults before processing document
|
||||
- Trigger same "create-collection" broadcast as `tg-set-collection`
|
||||
- Ensure collection established before document flows to storage processors
|
||||
|
||||
### ❌ Pending Components
|
||||
|
||||
1. **Collection State Tracking** - Need to implement in each storage backend:
|
||||
- **Cassandra Triples**: Use `triples_collection` table with marker triples
|
||||
- **Neo4j/Memgraph/FalkorDB**: Create `:CollectionMetadata` nodes
|
||||
- **Qdrant/Milvus/Pinecone**: Use native collection APIs
|
||||
- **Cassandra Objects**: Add collection metadata tracking
|
||||
|
||||
2. **Storage Management Handlers** - Need "create-collection" support in 12 files:
|
||||
- `trustgraph-flow/trustgraph/storage/triples/cassandra/write.py`
|
||||
- `trustgraph-flow/trustgraph/storage/triples/neo4j/write.py`
|
||||
- `trustgraph-flow/trustgraph/storage/triples/memgraph/write.py`
|
||||
- `trustgraph-flow/trustgraph/storage/triples/falkordb/write.py`
|
||||
- `trustgraph-flow/trustgraph/storage/doc_embeddings/qdrant/write.py`
|
||||
- `trustgraph-flow/trustgraph/storage/graph_embeddings/qdrant/write.py`
|
||||
- `trustgraph-flow/trustgraph/storage/doc_embeddings/milvus/write.py`
|
||||
- `trustgraph-flow/trustgraph/storage/graph_embeddings/milvus/write.py`
|
||||
- `trustgraph-flow/trustgraph/storage/doc_embeddings/pinecone/write.py`
|
||||
- `trustgraph-flow/trustgraph/storage/graph_embeddings/pinecone/write.py`
|
||||
- `trustgraph-flow/trustgraph/storage/objects/cassandra/write.py`
|
||||
- Plus any other storage implementations
|
||||
|
||||
3. **Write Operation Validation** - Add collection existence checks to all `store_*` methods
|
||||
|
||||
4. **Query Operation Handling** - Update queries to return empty for non-existent collections
|
||||
|
||||
### Next Implementation Steps
|
||||
|
||||
1. **Define Storage Management Topics** in `trustgraph-base/trustgraph/schema/services/storage.py`
|
||||
2. **Implement Collection Management Handlers** in each storage writer:
|
||||
- Add `StorageManagementRequest` consumers
|
||||
- Implement collection deletion operations
|
||||
- Add response producers for status reporting
|
||||
3. **Test End-to-End Collection Deletion** across all storage types
|
||||
**Phase 1: Core Infrastructure (2-3 days)**
|
||||
1. Add collection state tracking methods to all storage backends
|
||||
2. Implement `collection_exists()` and `create_collection()` methods
|
||||
|
||||
## Timeline
|
||||
**Phase 2: Storage Handlers (1 week)**
|
||||
3. Add "create-collection" handlers to all storage processors
|
||||
4. Add write validation to reject non-existent collections
|
||||
5. Update query handling for non-existent collections
|
||||
|
||||
Phase 1 (Storage Topics): 1-2 days
|
||||
Phase 2 (Store Handlers): 1-2 weeks depending on number of storage backends
|
||||
Phase 3 (Testing & Integration): 3-5 days
|
||||
**Phase 3: Collection Manager (2-3 days)**
|
||||
6. Update collection_manager to broadcast creates
|
||||
7. Implement response tracking and error handling
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Should collection deletion be soft or hard delete by default?
|
||||
- What metadata fields should be required vs optional?
|
||||
- Should we implement storage management handlers incrementally by store type?
|
||||
**Phase 4: Testing (3-5 days)**
|
||||
8. End-to-end testing of explicit creation workflow
|
||||
9. Test all storage backends
|
||||
10. Validate error handling and edge cases
|
||||
|
||||
|
|
|
|||
|
|
@ -29,23 +29,25 @@ class TestEndToEndConfigurationFlow:
|
|||
'CASSANDRA_USERNAME': 'integration-user',
|
||||
'CASSANDRA_PASSWORD': 'integration-pass'
|
||||
}
|
||||
|
||||
|
||||
mock_cluster_instance = MagicMock()
|
||||
mock_session = MagicMock()
|
||||
mock_cluster_instance.connect.return_value = mock_session
|
||||
mock_cluster.return_value = mock_cluster_instance
|
||||
|
||||
|
||||
with patch.dict(os.environ, env_vars, clear=True):
|
||||
processor = TriplesWriter(taskgroup=MagicMock())
|
||||
|
||||
|
||||
# Create a mock message to trigger TrustGraph creation
|
||||
mock_message = MagicMock()
|
||||
mock_message.metadata.user = 'test_user'
|
||||
mock_message.metadata.collection = 'test_collection'
|
||||
mock_message.triples = []
|
||||
|
||||
# This should create TrustGraph with environment config
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Mock collection_exists to return True
|
||||
with patch('trustgraph.direct.cassandra_kg.KnowledgeGraph.collection_exists', return_value=True):
|
||||
# This should create TrustGraph with environment config
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Verify Cluster was created with correct hosts
|
||||
mock_cluster.assert_called_once()
|
||||
|
|
@ -145,8 +147,10 @@ class TestConfigurationPriorityEndToEnd:
|
|||
mock_message.metadata.user = 'test_user'
|
||||
mock_message.metadata.collection = 'test_collection'
|
||||
mock_message.triples = []
|
||||
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Mock collection_exists to return True
|
||||
with patch('trustgraph.direct.cassandra_kg.KnowledgeGraph.collection_exists', return_value=True):
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Should use CLI parameters, not environment
|
||||
mock_cluster.assert_called_once()
|
||||
|
|
@ -243,8 +247,10 @@ class TestNoBackwardCompatibilityEndToEnd:
|
|||
mock_message.metadata.user = 'legacy_user'
|
||||
mock_message.metadata.collection = 'legacy_collection'
|
||||
mock_message.triples = []
|
||||
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Mock collection_exists to return True
|
||||
with patch('trustgraph.direct.cassandra_kg.KnowledgeGraph.collection_exists', return_value=True):
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Should use defaults since old parameters are not recognized
|
||||
mock_cluster.assert_called_once()
|
||||
|
|
@ -299,8 +305,10 @@ class TestNoBackwardCompatibilityEndToEnd:
|
|||
mock_message.metadata.user = 'precedence_user'
|
||||
mock_message.metadata.collection = 'precedence_collection'
|
||||
mock_message.triples = []
|
||||
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Mock collection_exists to return True
|
||||
with patch('trustgraph.direct.cassandra_kg.KnowledgeGraph.collection_exists', return_value=True):
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Should use new parameters, not old ones
|
||||
mock_cluster.assert_called_once()
|
||||
|
|
@ -349,8 +357,10 @@ class TestMultipleHostsHandling:
|
|||
mock_message.metadata.user = 'single_user'
|
||||
mock_message.metadata.collection = 'single_collection'
|
||||
mock_message.triples = []
|
||||
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Mock collection_exists to return True
|
||||
with patch('trustgraph.direct.cassandra_kg.KnowledgeGraph.collection_exists', return_value=True):
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Single host should be converted to list
|
||||
mock_cluster.assert_called_once()
|
||||
|
|
|
|||
|
|
@ -22,7 +22,36 @@ class TestObjectsCassandraIntegration:
|
|||
def mock_cassandra_session(self):
|
||||
"""Mock Cassandra session for integration tests"""
|
||||
session = MagicMock()
|
||||
session.execute = MagicMock()
|
||||
|
||||
# Track if keyspaces have been created
|
||||
created_keyspaces = set()
|
||||
|
||||
# Mock the execute method to return a valid result for keyspace checks
|
||||
def execute_mock(query, *args, **kwargs):
|
||||
result = MagicMock()
|
||||
query_str = str(query)
|
||||
|
||||
# Track keyspace creation
|
||||
if "CREATE KEYSPACE" in query_str:
|
||||
# Extract keyspace name from query
|
||||
import re
|
||||
match = re.search(r'CREATE KEYSPACE IF NOT EXISTS (\w+)', query_str)
|
||||
if match:
|
||||
created_keyspaces.add(match.group(1))
|
||||
|
||||
# For keyspace existence checks
|
||||
if "system_schema.keyspaces" in query_str:
|
||||
# Check if this keyspace was created
|
||||
if args and args[0] in created_keyspaces:
|
||||
result.one.return_value = MagicMock() # Exists
|
||||
else:
|
||||
result.one.return_value = None # Doesn't exist
|
||||
else:
|
||||
result.one.return_value = None
|
||||
|
||||
return result
|
||||
|
||||
session.execute = MagicMock(side_effect=execute_mock)
|
||||
return session
|
||||
|
||||
@pytest.fixture
|
||||
|
|
@ -57,7 +86,8 @@ class TestObjectsCassandraIntegration:
|
|||
processor.convert_value = Processor.convert_value.__get__(processor, Processor)
|
||||
processor.on_schema_config = Processor.on_schema_config.__get__(processor, Processor)
|
||||
processor.on_object = Processor.on_object.__get__(processor, Processor)
|
||||
|
||||
processor.create_collection = Processor.create_collection.__get__(processor, Processor)
|
||||
|
||||
return processor, mock_cassandra_cluster, mock_cassandra_session
|
||||
|
||||
@pytest.mark.asyncio
|
||||
|
|
@ -85,7 +115,10 @@ class TestObjectsCassandraIntegration:
|
|||
|
||||
await processor.on_schema_config(config, version=1)
|
||||
assert "customer_records" in processor.schemas
|
||||
|
||||
|
||||
# Step 1.5: Create the collection first (simulate tg-set-collection)
|
||||
await processor.create_collection("test_user", "import_2024")
|
||||
|
||||
# Step 2: Process an ExtractedObject
|
||||
test_obj = ExtractedObject(
|
||||
metadata=Metadata(
|
||||
|
|
@ -104,10 +137,10 @@ class TestObjectsCassandraIntegration:
|
|||
confidence=0.95,
|
||||
source_span="Customer: John Doe..."
|
||||
)
|
||||
|
||||
|
||||
msg = MagicMock()
|
||||
msg.value.return_value = test_obj
|
||||
|
||||
|
||||
await processor.on_object(msg, None, None)
|
||||
|
||||
# Verify Cassandra interactions
|
||||
|
|
@ -178,7 +211,11 @@ class TestObjectsCassandraIntegration:
|
|||
|
||||
await processor.on_schema_config(config, version=1)
|
||||
assert len(processor.schemas) == 2
|
||||
|
||||
|
||||
# Create collections first
|
||||
await processor.create_collection("shop", "catalog")
|
||||
await processor.create_collection("shop", "sales")
|
||||
|
||||
# Process objects for different schemas
|
||||
product_obj = ExtractedObject(
|
||||
metadata=Metadata(id="p1", user="shop", collection="catalog", metadata=[]),
|
||||
|
|
@ -187,7 +224,7 @@ class TestObjectsCassandraIntegration:
|
|||
confidence=0.9,
|
||||
source_span="Product..."
|
||||
)
|
||||
|
||||
|
||||
order_obj = ExtractedObject(
|
||||
metadata=Metadata(id="o1", user="shop", collection="sales", metadata=[]),
|
||||
schema_name="orders",
|
||||
|
|
@ -195,7 +232,7 @@ class TestObjectsCassandraIntegration:
|
|||
confidence=0.85,
|
||||
source_span="Order..."
|
||||
)
|
||||
|
||||
|
||||
# Process both objects
|
||||
for obj in [product_obj, order_obj]:
|
||||
msg = MagicMock()
|
||||
|
|
@ -225,6 +262,9 @@ class TestObjectsCassandraIntegration:
|
|||
]
|
||||
)
|
||||
|
||||
# Create collection first
|
||||
await processor.create_collection("test", "test")
|
||||
|
||||
# Create object missing required field
|
||||
test_obj = ExtractedObject(
|
||||
metadata=Metadata(id="t1", user="test", collection="test", metadata=[]),
|
||||
|
|
@ -233,10 +273,10 @@ class TestObjectsCassandraIntegration:
|
|||
confidence=0.8,
|
||||
source_span="Test"
|
||||
)
|
||||
|
||||
|
||||
msg = MagicMock()
|
||||
msg.value.return_value = test_obj
|
||||
|
||||
|
||||
# Should still process (Cassandra doesn't enforce NOT NULL)
|
||||
await processor.on_object(msg, None, None)
|
||||
|
||||
|
|
@ -261,6 +301,9 @@ class TestObjectsCassandraIntegration:
|
|||
]
|
||||
)
|
||||
|
||||
# Create collection first
|
||||
await processor.create_collection("logger", "app_events")
|
||||
|
||||
# Process object
|
||||
test_obj = ExtractedObject(
|
||||
metadata=Metadata(id="e1", user="logger", collection="app_events", metadata=[]),
|
||||
|
|
@ -269,10 +312,10 @@ class TestObjectsCassandraIntegration:
|
|||
confidence=1.0,
|
||||
source_span="Event"
|
||||
)
|
||||
|
||||
|
||||
msg = MagicMock()
|
||||
msg.value.return_value = test_obj
|
||||
|
||||
|
||||
await processor.on_object(msg, None, None)
|
||||
|
||||
# Verify synthetic_id was added
|
||||
|
|
@ -325,8 +368,10 @@ class TestObjectsCassandraIntegration:
|
|||
)
|
||||
|
||||
# Make insert fail
|
||||
mock_result = MagicMock()
|
||||
mock_result.one.return_value = MagicMock() # Keyspace exists
|
||||
mock_session.execute.side_effect = [
|
||||
None, # keyspace creation succeeds
|
||||
mock_result, # keyspace existence check succeeds
|
||||
None, # table creation succeeds
|
||||
Exception("Connection timeout") # insert fails
|
||||
]
|
||||
|
|
@ -359,7 +404,11 @@ class TestObjectsCassandraIntegration:
|
|||
|
||||
# Process objects from different collections
|
||||
collections = ["import_jan", "import_feb", "import_mar"]
|
||||
|
||||
|
||||
# Create all collections first
|
||||
for coll in collections:
|
||||
await processor.create_collection("analytics", coll)
|
||||
|
||||
for coll in collections:
|
||||
obj = ExtractedObject(
|
||||
metadata=Metadata(id=f"{coll}-1", user="analytics", collection=coll, metadata=[]),
|
||||
|
|
@ -368,7 +417,7 @@ class TestObjectsCassandraIntegration:
|
|||
confidence=0.9,
|
||||
source_span="Data"
|
||||
)
|
||||
|
||||
|
||||
msg = MagicMock()
|
||||
msg.value.return_value = obj
|
||||
await processor.on_object(msg, None, None)
|
||||
|
|
@ -436,9 +485,12 @@ class TestObjectsCassandraIntegration:
|
|||
source_span="Multiple customers extracted from document"
|
||||
)
|
||||
|
||||
# Create collection first
|
||||
await processor.create_collection("test_user", "batch_import")
|
||||
|
||||
msg = MagicMock()
|
||||
msg.value.return_value = batch_obj
|
||||
|
||||
|
||||
await processor.on_object(msg, None, None)
|
||||
|
||||
# Verify table creation
|
||||
|
|
@ -479,6 +531,9 @@ class TestObjectsCassandraIntegration:
|
|||
fields=[Field(name="id", type="string", size=50, primary=True)]
|
||||
)
|
||||
|
||||
# Create collection first
|
||||
await processor.create_collection("test", "empty")
|
||||
|
||||
# Process empty batch object
|
||||
empty_obj = ExtractedObject(
|
||||
metadata=Metadata(id="empty-1", user="test", collection="empty", metadata=[]),
|
||||
|
|
@ -487,10 +542,10 @@ class TestObjectsCassandraIntegration:
|
|||
confidence=1.0,
|
||||
source_span="No objects found"
|
||||
)
|
||||
|
||||
|
||||
msg = MagicMock()
|
||||
msg.value.return_value = empty_obj
|
||||
|
||||
|
||||
await processor.on_object(msg, None, None)
|
||||
|
||||
# Should still create table
|
||||
|
|
@ -517,6 +572,9 @@ class TestObjectsCassandraIntegration:
|
|||
]
|
||||
)
|
||||
|
||||
# Create collection first
|
||||
await processor.create_collection("test", "mixed")
|
||||
|
||||
# Single object (backward compatibility)
|
||||
single_obj = ExtractedObject(
|
||||
metadata=Metadata(id="single", user="test", collection="mixed", metadata=[]),
|
||||
|
|
@ -525,7 +583,7 @@ class TestObjectsCassandraIntegration:
|
|||
confidence=0.9,
|
||||
source_span="Single object"
|
||||
)
|
||||
|
||||
|
||||
# Batch object
|
||||
batch_obj = ExtractedObject(
|
||||
metadata=Metadata(id="batch", user="test", collection="mixed", metadata=[]),
|
||||
|
|
@ -537,7 +595,7 @@ class TestObjectsCassandraIntegration:
|
|||
confidence=0.85,
|
||||
source_span="Batch objects"
|
||||
)
|
||||
|
||||
|
||||
# Process both
|
||||
for obj in [single_obj, batch_obj]:
|
||||
msg = MagicMock()
|
||||
|
|
|
|||
|
|
@ -178,37 +178,24 @@ class TestPineconeDocEmbeddingsStorageProcessor:
|
|||
assert calls[2][1]['vectors'][0]['metadata']['doc'] == "This is the second document chunk"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_store_document_embeddings_index_creation(self, processor):
|
||||
"""Test automatic index creation when index doesn't exist"""
|
||||
async def test_store_document_embeddings_index_validation(self, processor):
|
||||
"""Test that writing to non-existent index raises ValueError"""
|
||||
message = MagicMock()
|
||||
message.metadata = MagicMock()
|
||||
message.metadata.user = 'test_user'
|
||||
message.metadata.collection = 'test_collection'
|
||||
|
||||
|
||||
chunk = ChunkEmbeddings(
|
||||
chunk=b"Test document content",
|
||||
vectors=[[0.1, 0.2, 0.3]]
|
||||
)
|
||||
message.chunks = [chunk]
|
||||
|
||||
# Mock index doesn't exist initially
|
||||
|
||||
# Mock index doesn't exist
|
||||
processor.pinecone.has_index.return_value = False
|
||||
mock_index = MagicMock()
|
||||
processor.pinecone.Index.return_value = mock_index
|
||||
|
||||
# Mock index creation
|
||||
processor.pinecone.describe_index.return_value.status = {"ready": True}
|
||||
|
||||
with patch('uuid.uuid4', return_value='test-id'):
|
||||
|
||||
with pytest.raises(ValueError, match="Collection .* does not exist"):
|
||||
await processor.store_document_embeddings(message)
|
||||
|
||||
# Verify index creation was called
|
||||
expected_index_name = "d-test_user-test_collection"
|
||||
processor.pinecone.create_index.assert_called_once()
|
||||
create_call = processor.pinecone.create_index.call_args
|
||||
assert create_call[1]['name'] == expected_index_name
|
||||
assert create_call[1]['dimension'] == 3
|
||||
assert create_call[1]['metric'] == "cosine"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_store_document_embeddings_empty_chunk(self, processor):
|
||||
|
|
@ -357,47 +344,44 @@ class TestPineconeDocEmbeddingsStorageProcessor:
|
|||
mock_index.upsert.assert_not_called()
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_store_document_embeddings_index_creation_failure(self, processor):
|
||||
"""Test handling of index creation failure"""
|
||||
async def test_store_document_embeddings_validation_before_creation(self, processor):
|
||||
"""Test that validation error occurs before creation attempts"""
|
||||
message = MagicMock()
|
||||
message.metadata = MagicMock()
|
||||
message.metadata.user = 'test_user'
|
||||
message.metadata.collection = 'test_collection'
|
||||
|
||||
|
||||
chunk = ChunkEmbeddings(
|
||||
chunk=b"Test document content",
|
||||
vectors=[[0.1, 0.2, 0.3]]
|
||||
)
|
||||
message.chunks = [chunk]
|
||||
|
||||
# Mock index doesn't exist and creation fails
|
||||
|
||||
# Mock index doesn't exist
|
||||
processor.pinecone.has_index.return_value = False
|
||||
processor.pinecone.create_index.side_effect = Exception("Index creation failed")
|
||||
|
||||
with pytest.raises(Exception, match="Index creation failed"):
|
||||
|
||||
with pytest.raises(ValueError, match="Collection .* does not exist"):
|
||||
await processor.store_document_embeddings(message)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_store_document_embeddings_index_creation_timeout(self, processor):
|
||||
"""Test handling of index creation timeout"""
|
||||
async def test_store_document_embeddings_validates_before_timeout(self, processor):
|
||||
"""Test that validation error occurs before timeout checks"""
|
||||
message = MagicMock()
|
||||
message.metadata = MagicMock()
|
||||
message.metadata.user = 'test_user'
|
||||
message.metadata.collection = 'test_collection'
|
||||
|
||||
|
||||
chunk = ChunkEmbeddings(
|
||||
chunk=b"Test document content",
|
||||
vectors=[[0.1, 0.2, 0.3]]
|
||||
)
|
||||
message.chunks = [chunk]
|
||||
|
||||
# Mock index doesn't exist and never becomes ready
|
||||
|
||||
# Mock index doesn't exist
|
||||
processor.pinecone.has_index.return_value = False
|
||||
processor.pinecone.describe_index.return_value.status = {"ready": False}
|
||||
|
||||
with patch('time.sleep'): # Speed up the test
|
||||
with pytest.raises(RuntimeError, match="Gave up waiting for index creation"):
|
||||
await processor.store_document_embeddings(message)
|
||||
|
||||
with pytest.raises(ValueError, match="Collection .* does not exist"):
|
||||
await processor.store_document_embeddings(message)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_store_document_embeddings_unicode_content(self, processor):
|
||||
|
|
|
|||
|
|
@ -43,8 +43,6 @@ class TestQdrantDocEmbeddingsStorage(IsolatedAsyncioTestCase):
|
|||
# Verify processor attributes
|
||||
assert hasattr(processor, 'qdrant')
|
||||
assert processor.qdrant == mock_qdrant_instance
|
||||
assert hasattr(processor, 'last_collection')
|
||||
assert processor.last_collection is None
|
||||
|
||||
@patch('trustgraph.storage.doc_embeddings.qdrant.write.QdrantClient')
|
||||
@patch('trustgraph.base.DocumentEmbeddingsStoreService.__init__')
|
||||
|
|
@ -245,8 +243,9 @@ class TestQdrantDocEmbeddingsStorage(IsolatedAsyncioTestCase):
|
|||
# Arrange
|
||||
mock_base_init.return_value = None
|
||||
mock_qdrant_instance = MagicMock()
|
||||
mock_qdrant_instance.collection_exists.return_value = True # Collection exists
|
||||
mock_qdrant_client.return_value = mock_qdrant_instance
|
||||
|
||||
|
||||
config = {
|
||||
'store_uri': 'http://localhost:6333',
|
||||
'api_key': 'test-api-key',
|
||||
|
|
@ -255,36 +254,37 @@ class TestQdrantDocEmbeddingsStorage(IsolatedAsyncioTestCase):
|
|||
}
|
||||
|
||||
processor = Processor(**config)
|
||||
|
||||
|
||||
# Create mock message with empty chunk
|
||||
mock_message = MagicMock()
|
||||
mock_message.metadata.user = 'empty_user'
|
||||
mock_message.metadata.collection = 'empty_collection'
|
||||
|
||||
|
||||
mock_chunk_empty = MagicMock()
|
||||
mock_chunk_empty.chunk.decode.return_value = "" # Empty string
|
||||
mock_chunk_empty.vectors = [[0.1, 0.2]]
|
||||
|
||||
|
||||
mock_message.chunks = [mock_chunk_empty]
|
||||
|
||||
|
||||
# Act
|
||||
await processor.store_document_embeddings(mock_message)
|
||||
|
||||
# Assert
|
||||
# Should not call upsert for empty chunks
|
||||
mock_qdrant_instance.upsert.assert_not_called()
|
||||
mock_qdrant_instance.collection_exists.assert_not_called()
|
||||
# But collection_exists should be called for validation
|
||||
mock_qdrant_instance.collection_exists.assert_called_once()
|
||||
|
||||
@patch('trustgraph.storage.doc_embeddings.qdrant.write.QdrantClient')
|
||||
@patch('trustgraph.base.DocumentEmbeddingsStoreService.__init__')
|
||||
async def test_collection_creation_when_not_exists(self, mock_base_init, mock_qdrant_client):
|
||||
"""Test collection creation when it doesn't exist"""
|
||||
"""Test that writing to non-existent collection raises ValueError"""
|
||||
# Arrange
|
||||
mock_base_init.return_value = None
|
||||
mock_qdrant_instance = MagicMock()
|
||||
mock_qdrant_instance.collection_exists.return_value = False # Collection doesn't exist
|
||||
mock_qdrant_client.return_value = mock_qdrant_instance
|
||||
|
||||
|
||||
config = {
|
||||
'store_uri': 'http://localhost:6333',
|
||||
'api_key': 'test-api-key',
|
||||
|
|
@ -293,46 +293,32 @@ class TestQdrantDocEmbeddingsStorage(IsolatedAsyncioTestCase):
|
|||
}
|
||||
|
||||
processor = Processor(**config)
|
||||
|
||||
|
||||
# Create mock message
|
||||
mock_message = MagicMock()
|
||||
mock_message.metadata.user = 'new_user'
|
||||
mock_message.metadata.collection = 'new_collection'
|
||||
|
||||
|
||||
mock_chunk = MagicMock()
|
||||
mock_chunk.chunk.decode.return_value = 'test chunk'
|
||||
mock_chunk.vectors = [[0.1, 0.2, 0.3, 0.4, 0.5]] # 5 dimensions
|
||||
|
||||
mock_message.chunks = [mock_chunk]
|
||||
|
||||
# Act
|
||||
await processor.store_document_embeddings(mock_message)
|
||||
|
||||
# Assert
|
||||
expected_collection = 'd_new_user_new_collection'
|
||||
|
||||
# Verify collection existence check and creation
|
||||
mock_qdrant_instance.collection_exists.assert_called_once_with(expected_collection)
|
||||
mock_qdrant_instance.create_collection.assert_called_once()
|
||||
|
||||
# Verify create_collection was called with correct parameters
|
||||
create_call_args = mock_qdrant_instance.create_collection.call_args
|
||||
assert create_call_args[1]['collection_name'] == expected_collection
|
||||
|
||||
# Verify upsert was still called after collection creation
|
||||
mock_qdrant_instance.upsert.assert_called_once()
|
||||
mock_message.chunks = [mock_chunk]
|
||||
|
||||
# Act & Assert
|
||||
with pytest.raises(ValueError, match="Collection .* does not exist"):
|
||||
await processor.store_document_embeddings(mock_message)
|
||||
|
||||
@patch('trustgraph.storage.doc_embeddings.qdrant.write.QdrantClient')
|
||||
@patch('trustgraph.base.DocumentEmbeddingsStoreService.__init__')
|
||||
async def test_collection_creation_exception(self, mock_base_init, mock_qdrant_client):
|
||||
"""Test collection creation handles exceptions"""
|
||||
"""Test that validation error occurs before connection errors"""
|
||||
# Arrange
|
||||
mock_base_init.return_value = None
|
||||
mock_qdrant_instance = MagicMock()
|
||||
mock_qdrant_instance.collection_exists.return_value = False
|
||||
mock_qdrant_instance.create_collection.side_effect = Exception("Qdrant connection failed")
|
||||
mock_qdrant_instance.collection_exists.return_value = False # Collection doesn't exist
|
||||
mock_qdrant_client.return_value = mock_qdrant_instance
|
||||
|
||||
|
||||
config = {
|
||||
'store_uri': 'http://localhost:6333',
|
||||
'api_key': 'test-api-key',
|
||||
|
|
@ -341,32 +327,35 @@ class TestQdrantDocEmbeddingsStorage(IsolatedAsyncioTestCase):
|
|||
}
|
||||
|
||||
processor = Processor(**config)
|
||||
|
||||
|
||||
# Create mock message
|
||||
mock_message = MagicMock()
|
||||
mock_message.metadata.user = 'error_user'
|
||||
mock_message.metadata.collection = 'error_collection'
|
||||
|
||||
|
||||
mock_chunk = MagicMock()
|
||||
mock_chunk.chunk.decode.return_value = 'test chunk'
|
||||
mock_chunk.vectors = [[0.1, 0.2]]
|
||||
|
||||
|
||||
mock_message.chunks = [mock_chunk]
|
||||
|
||||
|
||||
# Act & Assert
|
||||
with pytest.raises(Exception, match="Qdrant connection failed"):
|
||||
with pytest.raises(ValueError, match="Collection .* does not exist"):
|
||||
await processor.store_document_embeddings(mock_message)
|
||||
|
||||
@patch('trustgraph.storage.doc_embeddings.qdrant.write.QdrantClient')
|
||||
@patch('trustgraph.base.DocumentEmbeddingsStoreService.__init__')
|
||||
async def test_collection_caching_behavior(self, mock_base_init, mock_qdrant_client):
|
||||
"""Test collection caching with last_collection"""
|
||||
@patch('trustgraph.storage.doc_embeddings.qdrant.write.uuid')
|
||||
async def test_collection_validation_on_write(self, mock_uuid, mock_base_init, mock_qdrant_client):
|
||||
"""Test collection validation checks collection exists before writing"""
|
||||
# Arrange
|
||||
mock_base_init.return_value = None
|
||||
mock_qdrant_instance = MagicMock()
|
||||
mock_qdrant_instance.collection_exists.return_value = True
|
||||
mock_qdrant_client.return_value = mock_qdrant_instance
|
||||
|
||||
mock_uuid.uuid4.return_value = MagicMock()
|
||||
mock_uuid.uuid4.return_value.__str__ = MagicMock(return_value='test-uuid')
|
||||
|
||||
config = {
|
||||
'store_uri': 'http://localhost:6333',
|
||||
'api_key': 'test-api-key',
|
||||
|
|
@ -375,46 +364,45 @@ class TestQdrantDocEmbeddingsStorage(IsolatedAsyncioTestCase):
|
|||
}
|
||||
|
||||
processor = Processor(**config)
|
||||
|
||||
|
||||
# Create first mock message
|
||||
mock_message1 = MagicMock()
|
||||
mock_message1.metadata.user = 'cache_user'
|
||||
mock_message1.metadata.collection = 'cache_collection'
|
||||
|
||||
|
||||
mock_chunk1 = MagicMock()
|
||||
mock_chunk1.chunk.decode.return_value = 'first chunk'
|
||||
mock_chunk1.vectors = [[0.1, 0.2, 0.3]]
|
||||
|
||||
|
||||
mock_message1.chunks = [mock_chunk1]
|
||||
|
||||
|
||||
# First call
|
||||
await processor.store_document_embeddings(mock_message1)
|
||||
|
||||
|
||||
# Reset mock to track second call
|
||||
mock_qdrant_instance.reset_mock()
|
||||
|
||||
mock_qdrant_instance.collection_exists.return_value = True
|
||||
|
||||
# Create second mock message with same dimensions
|
||||
mock_message2 = MagicMock()
|
||||
mock_message2.metadata.user = 'cache_user'
|
||||
mock_message2.metadata.collection = 'cache_collection'
|
||||
|
||||
|
||||
mock_chunk2 = MagicMock()
|
||||
mock_chunk2.chunk.decode.return_value = 'second chunk'
|
||||
mock_chunk2.vectors = [[0.4, 0.5, 0.6]] # Same dimension (3)
|
||||
|
||||
|
||||
mock_message2.chunks = [mock_chunk2]
|
||||
|
||||
|
||||
# Act - Second call with same collection
|
||||
await processor.store_document_embeddings(mock_message2)
|
||||
|
||||
# Assert
|
||||
expected_collection = 'd_cache_user_cache_collection'
|
||||
assert processor.last_collection == expected_collection
|
||||
|
||||
# Verify second call skipped existence check (cached)
|
||||
mock_qdrant_instance.collection_exists.assert_not_called()
|
||||
mock_qdrant_instance.create_collection.assert_not_called()
|
||||
|
||||
|
||||
# Verify collection existence is checked on each write
|
||||
mock_qdrant_instance.collection_exists.assert_called_once_with(expected_collection)
|
||||
|
||||
# But upsert should still be called
|
||||
mock_qdrant_instance.upsert.assert_called_once()
|
||||
|
||||
|
|
|
|||
|
|
@ -178,37 +178,24 @@ class TestPineconeGraphEmbeddingsStorageProcessor:
|
|||
assert calls[2][1]['vectors'][0]['metadata']['entity'] == "entity2"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_store_graph_embeddings_index_creation(self, processor):
|
||||
"""Test automatic index creation when index doesn't exist"""
|
||||
async def test_store_graph_embeddings_index_validation(self, processor):
|
||||
"""Test that writing to non-existent index raises ValueError"""
|
||||
message = MagicMock()
|
||||
message.metadata = MagicMock()
|
||||
message.metadata.user = 'test_user'
|
||||
message.metadata.collection = 'test_collection'
|
||||
|
||||
|
||||
entity = EntityEmbeddings(
|
||||
entity=Value(value="test_entity", is_uri=False),
|
||||
vectors=[[0.1, 0.2, 0.3]]
|
||||
)
|
||||
message.entities = [entity]
|
||||
|
||||
# Mock index doesn't exist initially
|
||||
|
||||
# Mock index doesn't exist
|
||||
processor.pinecone.has_index.return_value = False
|
||||
mock_index = MagicMock()
|
||||
processor.pinecone.Index.return_value = mock_index
|
||||
|
||||
# Mock index creation
|
||||
processor.pinecone.describe_index.return_value.status = {"ready": True}
|
||||
|
||||
with patch('uuid.uuid4', return_value='test-id'):
|
||||
|
||||
with pytest.raises(ValueError, match="Collection .* does not exist"):
|
||||
await processor.store_graph_embeddings(message)
|
||||
|
||||
# Verify index creation was called
|
||||
expected_index_name = "t-test_user-test_collection"
|
||||
processor.pinecone.create_index.assert_called_once()
|
||||
create_call = processor.pinecone.create_index.call_args
|
||||
assert create_call[1]['name'] == expected_index_name
|
||||
assert create_call[1]['dimension'] == 3
|
||||
assert create_call[1]['metric'] == "cosine"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_store_graph_embeddings_empty_entity_value(self, processor):
|
||||
|
|
@ -328,47 +315,44 @@ class TestPineconeGraphEmbeddingsStorageProcessor:
|
|||
mock_index.upsert.assert_not_called()
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_store_graph_embeddings_index_creation_failure(self, processor):
|
||||
"""Test handling of index creation failure"""
|
||||
async def test_store_graph_embeddings_validation_before_creation(self, processor):
|
||||
"""Test that validation error occurs before any creation attempts"""
|
||||
message = MagicMock()
|
||||
message.metadata = MagicMock()
|
||||
message.metadata.user = 'test_user'
|
||||
message.metadata.collection = 'test_collection'
|
||||
|
||||
|
||||
entity = EntityEmbeddings(
|
||||
entity=Value(value="test_entity", is_uri=False),
|
||||
vectors=[[0.1, 0.2, 0.3]]
|
||||
)
|
||||
message.entities = [entity]
|
||||
|
||||
# Mock index doesn't exist and creation fails
|
||||
|
||||
# Mock index doesn't exist
|
||||
processor.pinecone.has_index.return_value = False
|
||||
processor.pinecone.create_index.side_effect = Exception("Index creation failed")
|
||||
|
||||
with pytest.raises(Exception, match="Index creation failed"):
|
||||
|
||||
with pytest.raises(ValueError, match="Collection .* does not exist"):
|
||||
await processor.store_graph_embeddings(message)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_store_graph_embeddings_index_creation_timeout(self, processor):
|
||||
"""Test handling of index creation timeout"""
|
||||
async def test_store_graph_embeddings_validates_before_timeout(self, processor):
|
||||
"""Test that validation error occurs before timeout checks"""
|
||||
message = MagicMock()
|
||||
message.metadata = MagicMock()
|
||||
message.metadata.user = 'test_user'
|
||||
message.metadata.collection = 'test_collection'
|
||||
|
||||
|
||||
entity = EntityEmbeddings(
|
||||
entity=Value(value="test_entity", is_uri=False),
|
||||
vectors=[[0.1, 0.2, 0.3]]
|
||||
)
|
||||
message.entities = [entity]
|
||||
|
||||
# Mock index doesn't exist and never becomes ready
|
||||
|
||||
# Mock index doesn't exist
|
||||
processor.pinecone.has_index.return_value = False
|
||||
processor.pinecone.describe_index.return_value.status = {"ready": False}
|
||||
|
||||
with patch('time.sleep'): # Speed up the test
|
||||
with pytest.raises(RuntimeError, match="Gave up waiting for index creation"):
|
||||
await processor.store_graph_embeddings(message)
|
||||
|
||||
with pytest.raises(ValueError, match="Collection .* does not exist"):
|
||||
await processor.store_graph_embeddings(message)
|
||||
|
||||
def test_add_args_method(self):
|
||||
"""Test that add_args properly configures argument parser"""
|
||||
|
|
|
|||
|
|
@ -43,19 +43,17 @@ class TestQdrantGraphEmbeddingsStorage(IsolatedAsyncioTestCase):
|
|||
# Verify processor attributes
|
||||
assert hasattr(processor, 'qdrant')
|
||||
assert processor.qdrant == mock_qdrant_instance
|
||||
assert hasattr(processor, 'last_collection')
|
||||
assert processor.last_collection is None
|
||||
|
||||
@patch('trustgraph.storage.graph_embeddings.qdrant.write.QdrantClient')
|
||||
@patch('trustgraph.base.GraphEmbeddingsStoreService.__init__')
|
||||
async def test_get_collection_creates_new_collection(self, mock_base_init, mock_qdrant_client):
|
||||
"""Test get_collection creates a new collection when it doesn't exist"""
|
||||
async def test_get_collection_validates_existence(self, mock_base_init, mock_qdrant_client):
|
||||
"""Test get_collection validates that collection exists"""
|
||||
# Arrange
|
||||
mock_base_init.return_value = None
|
||||
mock_qdrant_instance = MagicMock()
|
||||
mock_qdrant_instance.collection_exists.return_value = False
|
||||
mock_qdrant_client.return_value = mock_qdrant_instance
|
||||
|
||||
|
||||
config = {
|
||||
'store_uri': 'http://localhost:6333',
|
||||
'api_key': 'test-api-key',
|
||||
|
|
@ -64,22 +62,10 @@ class TestQdrantGraphEmbeddingsStorage(IsolatedAsyncioTestCase):
|
|||
}
|
||||
|
||||
processor = Processor(**config)
|
||||
|
||||
# Act
|
||||
collection_name = processor.get_collection(dim=512, user='test_user', collection='test_collection')
|
||||
|
||||
# Assert
|
||||
expected_name = 't_test_user_test_collection'
|
||||
assert collection_name == expected_name
|
||||
assert processor.last_collection == expected_name
|
||||
|
||||
# Verify collection existence check and creation
|
||||
mock_qdrant_instance.collection_exists.assert_called_once_with(expected_name)
|
||||
mock_qdrant_instance.create_collection.assert_called_once()
|
||||
|
||||
# Verify create_collection was called with correct parameters
|
||||
create_call_args = mock_qdrant_instance.create_collection.call_args
|
||||
assert create_call_args[1]['collection_name'] == expected_name
|
||||
# Act & Assert
|
||||
with pytest.raises(ValueError, match="Collection .* does not exist"):
|
||||
processor.get_collection(user='test_user', collection='test_collection')
|
||||
|
||||
@patch('trustgraph.storage.graph_embeddings.qdrant.write.QdrantClient')
|
||||
@patch('trustgraph.storage.graph_embeddings.qdrant.write.uuid')
|
||||
|
|
@ -142,7 +128,7 @@ class TestQdrantGraphEmbeddingsStorage(IsolatedAsyncioTestCase):
|
|||
mock_qdrant_instance = MagicMock()
|
||||
mock_qdrant_instance.collection_exists.return_value = True # Collection exists
|
||||
mock_qdrant_client.return_value = mock_qdrant_instance
|
||||
|
||||
|
||||
config = {
|
||||
'store_uri': 'http://localhost:6333',
|
||||
'api_key': 'test-api-key',
|
||||
|
|
@ -151,15 +137,14 @@ class TestQdrantGraphEmbeddingsStorage(IsolatedAsyncioTestCase):
|
|||
}
|
||||
|
||||
processor = Processor(**config)
|
||||
|
||||
|
||||
# Act
|
||||
collection_name = processor.get_collection(dim=256, user='existing_user', collection='existing_collection')
|
||||
collection_name = processor.get_collection(user='existing_user', collection='existing_collection')
|
||||
|
||||
# Assert
|
||||
expected_name = 't_existing_user_existing_collection'
|
||||
assert collection_name == expected_name
|
||||
assert processor.last_collection == expected_name
|
||||
|
||||
|
||||
# Verify collection existence check was performed
|
||||
mock_qdrant_instance.collection_exists.assert_called_once_with(expected_name)
|
||||
# Verify create_collection was NOT called
|
||||
|
|
@ -167,14 +152,14 @@ class TestQdrantGraphEmbeddingsStorage(IsolatedAsyncioTestCase):
|
|||
|
||||
@patch('trustgraph.storage.graph_embeddings.qdrant.write.QdrantClient')
|
||||
@patch('trustgraph.base.GraphEmbeddingsStoreService.__init__')
|
||||
async def test_get_collection_caches_last_collection(self, mock_base_init, mock_qdrant_client):
|
||||
"""Test get_collection skips checks when using same collection"""
|
||||
async def test_get_collection_validates_on_each_call(self, mock_base_init, mock_qdrant_client):
|
||||
"""Test get_collection validates collection existence on each call"""
|
||||
# Arrange
|
||||
mock_base_init.return_value = None
|
||||
mock_qdrant_instance = MagicMock()
|
||||
mock_qdrant_instance.collection_exists.return_value = True
|
||||
mock_qdrant_client.return_value = mock_qdrant_instance
|
||||
|
||||
|
||||
config = {
|
||||
'store_uri': 'http://localhost:6333',
|
||||
'api_key': 'test-api-key',
|
||||
|
|
@ -183,36 +168,36 @@ class TestQdrantGraphEmbeddingsStorage(IsolatedAsyncioTestCase):
|
|||
}
|
||||
|
||||
processor = Processor(**config)
|
||||
|
||||
|
||||
# First call
|
||||
collection_name1 = processor.get_collection(dim=128, user='cache_user', collection='cache_collection')
|
||||
|
||||
collection_name1 = processor.get_collection(user='cache_user', collection='cache_collection')
|
||||
|
||||
# Reset mock to track second call
|
||||
mock_qdrant_instance.reset_mock()
|
||||
|
||||
mock_qdrant_instance.collection_exists.return_value = True
|
||||
|
||||
# Act - Second call with same parameters
|
||||
collection_name2 = processor.get_collection(dim=128, user='cache_user', collection='cache_collection')
|
||||
collection_name2 = processor.get_collection(user='cache_user', collection='cache_collection')
|
||||
|
||||
# Assert
|
||||
expected_name = 't_cache_user_cache_collection'
|
||||
assert collection_name1 == expected_name
|
||||
assert collection_name2 == expected_name
|
||||
|
||||
# Verify second call skipped existence check (cached)
|
||||
mock_qdrant_instance.collection_exists.assert_not_called()
|
||||
|
||||
# Verify collection existence check happens on each call
|
||||
mock_qdrant_instance.collection_exists.assert_called_once_with(expected_name)
|
||||
mock_qdrant_instance.create_collection.assert_not_called()
|
||||
|
||||
@patch('trustgraph.storage.graph_embeddings.qdrant.write.QdrantClient')
|
||||
@patch('trustgraph.base.GraphEmbeddingsStoreService.__init__')
|
||||
async def test_get_collection_creation_exception(self, mock_base_init, mock_qdrant_client):
|
||||
"""Test get_collection handles collection creation exceptions"""
|
||||
"""Test get_collection raises ValueError when collection doesn't exist"""
|
||||
# Arrange
|
||||
mock_base_init.return_value = None
|
||||
mock_qdrant_instance = MagicMock()
|
||||
mock_qdrant_instance.collection_exists.return_value = False
|
||||
mock_qdrant_instance.create_collection.side_effect = Exception("Qdrant connection failed")
|
||||
mock_qdrant_client.return_value = mock_qdrant_instance
|
||||
|
||||
|
||||
config = {
|
||||
'store_uri': 'http://localhost:6333',
|
||||
'api_key': 'test-api-key',
|
||||
|
|
@ -221,10 +206,10 @@ class TestQdrantGraphEmbeddingsStorage(IsolatedAsyncioTestCase):
|
|||
}
|
||||
|
||||
processor = Processor(**config)
|
||||
|
||||
|
||||
# Act & Assert
|
||||
with pytest.raises(Exception, match="Qdrant connection failed"):
|
||||
processor.get_collection(dim=512, user='error_user', collection='error_collection')
|
||||
with pytest.raises(ValueError, match="Collection .* does not exist"):
|
||||
processor.get_collection(user='error_user', collection='error_collection')
|
||||
|
||||
@patch('trustgraph.storage.graph_embeddings.qdrant.write.QdrantClient')
|
||||
@patch('trustgraph.storage.graph_embeddings.qdrant.write.uuid')
|
||||
|
|
|
|||
|
|
@ -47,7 +47,7 @@ class TestMemgraphUserCollectionIsolation:
|
|||
mock_graph_db.driver.return_value = mock_driver
|
||||
mock_session = MagicMock()
|
||||
mock_driver.session.return_value.__enter__.return_value = mock_session
|
||||
|
||||
|
||||
# Mock execute_query response
|
||||
mock_result = MagicMock()
|
||||
mock_summary = MagicMock()
|
||||
|
|
@ -55,28 +55,30 @@ class TestMemgraphUserCollectionIsolation:
|
|||
mock_summary.result_available_after = 10
|
||||
mock_result.summary = mock_summary
|
||||
mock_driver.execute_query.return_value = mock_result
|
||||
|
||||
|
||||
processor = Processor(taskgroup=MagicMock())
|
||||
|
||||
|
||||
# Create mock triple with URI object
|
||||
triple = MagicMock()
|
||||
triple.s.value = "http://example.com/subject"
|
||||
triple.p.value = "http://example.com/predicate"
|
||||
triple.o.value = "http://example.com/object"
|
||||
triple.o.is_uri = True
|
||||
|
||||
|
||||
# Create mock message with metadata
|
||||
mock_message = MagicMock()
|
||||
mock_message.triples = [triple]
|
||||
mock_message.metadata.user = "test_user"
|
||||
mock_message.metadata.collection = "test_collection"
|
||||
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Verify user/collection parameters were passed to all operations
|
||||
# Should have: create_node (subject), create_node (object), relate_node = 3 calls
|
||||
assert mock_driver.execute_query.call_count == 3
|
||||
|
||||
|
||||
# Check that user and collection were included in all calls
|
||||
for call in mock_driver.execute_query.call_args_list:
|
||||
call_kwargs = call.kwargs if hasattr(call, 'kwargs') else call[1]
|
||||
|
|
@ -93,7 +95,7 @@ class TestMemgraphUserCollectionIsolation:
|
|||
mock_graph_db.driver.return_value = mock_driver
|
||||
mock_session = MagicMock()
|
||||
mock_driver.session.return_value.__enter__.return_value = mock_session
|
||||
|
||||
|
||||
# Mock execute_query response
|
||||
mock_result = MagicMock()
|
||||
mock_summary = MagicMock()
|
||||
|
|
@ -101,24 +103,26 @@ class TestMemgraphUserCollectionIsolation:
|
|||
mock_summary.result_available_after = 10
|
||||
mock_result.summary = mock_summary
|
||||
mock_driver.execute_query.return_value = mock_result
|
||||
|
||||
|
||||
processor = Processor(taskgroup=MagicMock())
|
||||
|
||||
|
||||
# Create mock triple
|
||||
triple = MagicMock()
|
||||
triple.s.value = "http://example.com/subject"
|
||||
triple.p.value = "http://example.com/predicate"
|
||||
triple.o.value = "literal_value"
|
||||
triple.o.is_uri = False
|
||||
|
||||
|
||||
# Create mock message without user/collection metadata
|
||||
mock_message = MagicMock()
|
||||
mock_message.triples = [triple]
|
||||
mock_message.metadata.user = None
|
||||
mock_message.metadata.collection = None
|
||||
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Verify defaults were used
|
||||
for call in mock_driver.execute_query.call_args_list:
|
||||
call_kwargs = call.kwargs if hasattr(call, 'kwargs') else call[1]
|
||||
|
|
@ -295,7 +299,7 @@ class TestMemgraphUserCollectionRegression:
|
|||
mock_graph_db.driver.return_value = mock_driver
|
||||
mock_session = MagicMock()
|
||||
mock_driver.session.return_value.__enter__.return_value = mock_session
|
||||
|
||||
|
||||
# Mock execute_query response
|
||||
mock_result = MagicMock()
|
||||
mock_summary = MagicMock()
|
||||
|
|
@ -303,23 +307,25 @@ class TestMemgraphUserCollectionRegression:
|
|||
mock_summary.result_available_after = 10
|
||||
mock_result.summary = mock_summary
|
||||
mock_driver.execute_query.return_value = mock_result
|
||||
|
||||
|
||||
processor = Processor(taskgroup=MagicMock())
|
||||
|
||||
|
||||
# Store data for user1
|
||||
triple = MagicMock()
|
||||
triple.s.value = "http://example.com/subject"
|
||||
triple.p.value = "http://example.com/predicate"
|
||||
triple.o.value = "user1_data"
|
||||
triple.o.is_uri = False
|
||||
|
||||
|
||||
message_user1 = MagicMock()
|
||||
message_user1.triples = [triple]
|
||||
message_user1.metadata.user = "user1"
|
||||
message_user1.metadata.collection = "collection1"
|
||||
|
||||
await processor.store_triples(message_user1)
|
||||
|
||||
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
await processor.store_triples(message_user1)
|
||||
|
||||
# Verify that all storage operations included user1/collection1 parameters
|
||||
for call in mock_driver.execute_query.call_args_list:
|
||||
call_kwargs = call.kwargs if hasattr(call, 'kwargs') else call[1]
|
||||
|
|
|
|||
|
|
@ -75,8 +75,10 @@ class TestNeo4jUserCollectionIsolation:
|
|||
mock_summary.counters.nodes_created = 1
|
||||
mock_summary.result_available_after = 10
|
||||
mock_driver.execute_query.return_value.summary = mock_summary
|
||||
|
||||
await processor.store_triples(message)
|
||||
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
await processor.store_triples(message)
|
||||
|
||||
# Verify nodes and relationships were created with user/collection properties
|
||||
expected_calls = [
|
||||
|
|
@ -141,8 +143,10 @@ class TestNeo4jUserCollectionIsolation:
|
|||
mock_summary.counters.nodes_created = 1
|
||||
mock_summary.result_available_after = 10
|
||||
mock_driver.execute_query.return_value.summary = mock_summary
|
||||
|
||||
await processor.store_triples(message)
|
||||
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
await processor.store_triples(message)
|
||||
|
||||
# Verify defaults were used
|
||||
mock_driver.execute_query.assert_any_call(
|
||||
|
|
@ -273,10 +277,12 @@ class TestNeo4jUserCollectionIsolation:
|
|||
mock_summary.counters.nodes_created = 1
|
||||
mock_summary.result_available_after = 10
|
||||
mock_driver.execute_query.return_value.summary = mock_summary
|
||||
|
||||
# Store data for both users
|
||||
await processor.store_triples(message_user1)
|
||||
await processor.store_triples(message_user2)
|
||||
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
# Store data for both users
|
||||
await processor.store_triples(message_user1)
|
||||
await processor.store_triples(message_user2)
|
||||
|
||||
# Verify user1 data was stored with user1/coll1
|
||||
mock_driver.execute_query.assert_any_call(
|
||||
|
|
@ -446,9 +452,11 @@ class TestNeo4jUserCollectionRegression:
|
|||
mock_summary.counters.nodes_created = 1
|
||||
mock_summary.result_available_after = 10
|
||||
mock_driver.execute_query.return_value.summary = mock_summary
|
||||
|
||||
await processor.store_triples(message_user1)
|
||||
await processor.store_triples(message_user2)
|
||||
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
await processor.store_triples(message_user1)
|
||||
await processor.store_triples(message_user2)
|
||||
|
||||
# Verify two separate nodes were created with same URI but different user/collection
|
||||
user1_node_call = call(
|
||||
|
|
|
|||
|
|
@ -251,6 +251,8 @@ class TestObjectsCassandraStorageLogic:
|
|||
processor.convert_value = Processor.convert_value.__get__(processor, Processor)
|
||||
processor.session = MagicMock()
|
||||
processor.on_object = Processor.on_object.__get__(processor, Processor)
|
||||
processor.known_keyspaces = {"test_user"} # Pre-populate to skip validation query
|
||||
processor.known_tables = {"test_user": set()} # Pre-populate
|
||||
|
||||
# Create test object
|
||||
test_obj = ExtractedObject(
|
||||
|
|
@ -291,18 +293,19 @@ class TestObjectsCassandraStorageLogic:
|
|||
"""Test that secondary indexes are created for indexed fields"""
|
||||
processor = MagicMock()
|
||||
processor.schemas = {}
|
||||
processor.known_keyspaces = set()
|
||||
processor.known_tables = {}
|
||||
processor.known_keyspaces = {"test_user"} # Pre-populate to skip validation query
|
||||
processor.known_tables = {"test_user": set()} # Pre-populate
|
||||
processor.session = MagicMock()
|
||||
processor.sanitize_name = Processor.sanitize_name.__get__(processor, Processor)
|
||||
processor.sanitize_table = Processor.sanitize_table.__get__(processor, Processor)
|
||||
processor.get_cassandra_type = Processor.get_cassandra_type.__get__(processor, Processor)
|
||||
def mock_ensure_keyspace(keyspace):
|
||||
processor.known_keyspaces.add(keyspace)
|
||||
processor.known_tables[keyspace] = set()
|
||||
if keyspace not in processor.known_tables:
|
||||
processor.known_tables[keyspace] = set()
|
||||
processor.ensure_keyspace = mock_ensure_keyspace
|
||||
processor.ensure_table = Processor.ensure_table.__get__(processor, Processor)
|
||||
|
||||
|
||||
# Create schema with indexed field
|
||||
schema = RowSchema(
|
||||
name="products",
|
||||
|
|
@ -313,10 +316,10 @@ class TestObjectsCassandraStorageLogic:
|
|||
Field(name="price", type="float", size=8, indexed=True)
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
# Call ensure_table
|
||||
processor.ensure_table("test_user", "products", schema)
|
||||
|
||||
|
||||
# Should have 3 calls: create table + 2 indexes
|
||||
assert processor.session.execute.call_count == 3
|
||||
|
||||
|
|
@ -346,9 +349,10 @@ class TestObjectsCassandraStorageBatchLogic:
|
|||
]
|
||||
)
|
||||
}
|
||||
processor.known_keyspaces = {"test_user"} # Pre-populate to skip validation query
|
||||
processor.ensure_table = MagicMock()
|
||||
processor.sanitize_name = Processor.sanitize_name.__get__(processor, Processor)
|
||||
processor.sanitize_table = Processor.sanitize_table.__get__(processor, Processor)
|
||||
processor.sanitize_table = Processor.sanitize_table.__get__(processor, Processor)
|
||||
processor.convert_value = Processor.convert_value.__get__(processor, Processor)
|
||||
processor.session = MagicMock()
|
||||
processor.on_object = Processor.on_object.__get__(processor, Processor)
|
||||
|
|
@ -415,6 +419,8 @@ class TestObjectsCassandraStorageBatchLogic:
|
|||
processor.convert_value = Processor.convert_value.__get__(processor, Processor)
|
||||
processor.session = MagicMock()
|
||||
processor.on_object = Processor.on_object.__get__(processor, Processor)
|
||||
processor.known_keyspaces = {"test_user"} # Pre-populate to skip validation query
|
||||
processor.known_tables = {"test_user": set()} # Pre-populate
|
||||
|
||||
# Create empty batch object
|
||||
empty_batch_obj = ExtractedObject(
|
||||
|
|
@ -461,6 +467,8 @@ class TestObjectsCassandraStorageBatchLogic:
|
|||
processor.convert_value = Processor.convert_value.__get__(processor, Processor)
|
||||
processor.session = MagicMock()
|
||||
processor.on_object = Processor.on_object.__get__(processor, Processor)
|
||||
processor.known_keyspaces = {"test_user"} # Pre-populate to skip validation query
|
||||
processor.known_tables = {"test_user": set()} # Pre-populate
|
||||
|
||||
# Create single-item batch object (backward compatibility case)
|
||||
single_batch_obj = ExtractedObject(
|
||||
|
|
|
|||
|
|
@ -194,7 +194,13 @@ class TestFalkorDBStorageProcessor:
|
|||
mock_result.run_time_ms = 10
|
||||
processor.io.query.return_value = mock_result
|
||||
|
||||
await processor.store_triples(message)
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
|
||||
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
|
||||
|
||||
await processor.store_triples(message)
|
||||
|
||||
# Verify queries were called in the correct order
|
||||
expected_calls = [
|
||||
|
|
@ -225,7 +231,13 @@ class TestFalkorDBStorageProcessor:
|
|||
mock_result.run_time_ms = 10
|
||||
processor.io.query.return_value = mock_result
|
||||
|
||||
await processor.store_triples(mock_message)
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
|
||||
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
|
||||
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Verify queries were called in the correct order
|
||||
expected_calls = [
|
||||
|
|
@ -273,7 +285,13 @@ class TestFalkorDBStorageProcessor:
|
|||
mock_result.run_time_ms = 10
|
||||
processor.io.query.return_value = mock_result
|
||||
|
||||
await processor.store_triples(message)
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
|
||||
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
|
||||
|
||||
await processor.store_triples(message)
|
||||
|
||||
# Verify total number of queries (3 per triple)
|
||||
assert processor.io.query.call_count == 6
|
||||
|
|
@ -299,7 +317,13 @@ class TestFalkorDBStorageProcessor:
|
|||
message.metadata.collection = 'test_collection'
|
||||
message.triples = []
|
||||
|
||||
await processor.store_triples(message)
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
|
||||
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
|
||||
|
||||
await processor.store_triples(message)
|
||||
|
||||
# Verify no queries were made
|
||||
processor.io.query.assert_not_called()
|
||||
|
|
@ -329,7 +353,13 @@ class TestFalkorDBStorageProcessor:
|
|||
mock_result.run_time_ms = 10
|
||||
processor.io.query.return_value = mock_result
|
||||
|
||||
await processor.store_triples(message)
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
|
||||
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
|
||||
|
||||
await processor.store_triples(message)
|
||||
|
||||
# Verify total number of queries (3 per triple)
|
||||
assert processor.io.query.call_count == 6
|
||||
|
|
|
|||
|
|
@ -308,7 +308,13 @@ class TestMemgraphStorageProcessor:
|
|||
# Reset the mock to clear initialization calls
|
||||
processor.io.execute_query.reset_mock()
|
||||
|
||||
await processor.store_triples(mock_message)
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
|
||||
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
|
||||
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Verify execute_query was called for create_node, create_literal, and relate_literal
|
||||
# (since mock_message has a literal object)
|
||||
|
|
@ -352,7 +358,13 @@ class TestMemgraphStorageProcessor:
|
|||
)
|
||||
message.triples = [triple1, triple2]
|
||||
|
||||
await processor.store_triples(message)
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
|
||||
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
|
||||
|
||||
await processor.store_triples(message)
|
||||
|
||||
# Verify execute_query was called:
|
||||
# Triple1: create_node(s) + create_literal(o) + relate_literal = 3 calls
|
||||
|
|
@ -381,7 +393,13 @@ class TestMemgraphStorageProcessor:
|
|||
message.metadata.collection = 'test_collection'
|
||||
message.triples = []
|
||||
|
||||
await processor.store_triples(message)
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
|
||||
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
|
||||
|
||||
await processor.store_triples(message)
|
||||
|
||||
# Verify no session calls were made (no triples to process)
|
||||
processor.io.session.assert_not_called()
|
||||
|
|
|
|||
|
|
@ -268,7 +268,9 @@ class TestNeo4jStorageProcessor:
|
|||
mock_message.metadata.user = "test_user"
|
||||
mock_message.metadata.collection = "test_collection"
|
||||
|
||||
await processor.store_triples(mock_message)
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Verify create_node was called for subject and object
|
||||
# Verify relate_node was called
|
||||
|
|
@ -336,7 +338,9 @@ class TestNeo4jStorageProcessor:
|
|||
mock_message.metadata.user = "test_user"
|
||||
mock_message.metadata.collection = "test_collection"
|
||||
|
||||
await processor.store_triples(mock_message)
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Verify create_node was called for subject
|
||||
# Verify create_literal was called for object
|
||||
|
|
@ -411,7 +415,9 @@ class TestNeo4jStorageProcessor:
|
|||
mock_message.metadata.user = "test_user"
|
||||
mock_message.metadata.collection = "test_collection"
|
||||
|
||||
await processor.store_triples(mock_message)
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Should have processed both triples
|
||||
# Triple1: 2 nodes + 1 relationship = 3 calls
|
||||
|
|
@ -437,7 +443,9 @@ class TestNeo4jStorageProcessor:
|
|||
mock_message.metadata.user = "test_user"
|
||||
mock_message.metadata.collection = "test_collection"
|
||||
|
||||
await processor.store_triples(mock_message)
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Should not have made any execute_query calls beyond index creation
|
||||
# Only index creation calls should have been made during initialization
|
||||
|
|
@ -552,7 +560,9 @@ class TestNeo4jStorageProcessor:
|
|||
mock_message.metadata.user = "test_user"
|
||||
mock_message.metadata.collection = "test_collection"
|
||||
|
||||
await processor.store_triples(mock_message)
|
||||
# Mock collection_exists to bypass validation in unit tests
|
||||
with patch.object(processor, 'collection_exists', return_value=True):
|
||||
await processor.store_triples(mock_message)
|
||||
|
||||
# Verify the triple was processed with special characters preserved
|
||||
mock_driver.execute_query.assert_any_call(
|
||||
|
|
|
|||
|
|
@ -24,16 +24,12 @@ class KnowledgeGraph:
|
|||
self.keyspace = keyspace
|
||||
self.username = username
|
||||
|
||||
# Multi-table schema design for optimal performance
|
||||
self.use_legacy = os.getenv('CASSANDRA_USE_LEGACY', 'false').lower() == 'true'
|
||||
|
||||
if self.use_legacy:
|
||||
self.table = "triples" # Legacy single table
|
||||
else:
|
||||
# New optimized tables
|
||||
self.subject_table = "triples_s"
|
||||
self.po_table = "triples_p"
|
||||
self.object_table = "triples_o"
|
||||
# Optimized multi-table schema with collection deletion support
|
||||
self.subject_table = "triples_s"
|
||||
self.po_table = "triples_p"
|
||||
self.object_table = "triples_o"
|
||||
self.collection_table = "triples_collection" # For SPO queries and deletion
|
||||
self.collection_metadata_table = "collection_metadata" # For tracking which collections exist
|
||||
|
||||
if username and password:
|
||||
ssl_context = SSLContext(PROTOCOL_TLSv1_2)
|
||||
|
|
@ -47,9 +43,7 @@ class KnowledgeGraph:
|
|||
_active_clusters.append(self.cluster)
|
||||
|
||||
self.init()
|
||||
|
||||
if not self.use_legacy:
|
||||
self.prepare_statements()
|
||||
self.prepare_statements()
|
||||
|
||||
def clear(self):
|
||||
|
||||
|
|
@ -70,42 +64,13 @@ class KnowledgeGraph:
|
|||
""");
|
||||
|
||||
self.session.set_keyspace(self.keyspace)
|
||||
self.init_optimized_schema()
|
||||
|
||||
if self.use_legacy:
|
||||
self.init_legacy_schema()
|
||||
else:
|
||||
self.init_optimized_schema()
|
||||
|
||||
def init_legacy_schema(self):
|
||||
"""Initialize legacy single-table schema for backward compatibility"""
|
||||
self.session.execute(f"""
|
||||
create table if not exists {self.table} (
|
||||
collection text,
|
||||
s text,
|
||||
p text,
|
||||
o text,
|
||||
PRIMARY KEY (collection, s, p, o)
|
||||
);
|
||||
""");
|
||||
|
||||
self.session.execute(f"""
|
||||
create index if not exists {self.table}_s
|
||||
ON {self.table} (s);
|
||||
""");
|
||||
|
||||
self.session.execute(f"""
|
||||
create index if not exists {self.table}_p
|
||||
ON {self.table} (p);
|
||||
""");
|
||||
|
||||
self.session.execute(f"""
|
||||
create index if not exists {self.table}_o
|
||||
ON {self.table} (o);
|
||||
""");
|
||||
|
||||
def init_optimized_schema(self):
|
||||
"""Initialize optimized multi-table schema for performance"""
|
||||
# Table 1: Subject-centric queries (get_s, get_sp, get_spo, get_os)
|
||||
# Table 1: Subject-centric queries (get_s, get_sp, get_os)
|
||||
# Compound partition key for optimal data distribution
|
||||
self.session.execute(f"""
|
||||
CREATE TABLE IF NOT EXISTS {self.subject_table} (
|
||||
collection text,
|
||||
|
|
@ -117,6 +82,7 @@ class KnowledgeGraph:
|
|||
""");
|
||||
|
||||
# Table 2: Predicate-Object queries (get_p, get_po) - eliminates ALLOW FILTERING!
|
||||
# Compound partition key for optimal data distribution
|
||||
self.session.execute(f"""
|
||||
CREATE TABLE IF NOT EXISTS {self.po_table} (
|
||||
collection text,
|
||||
|
|
@ -128,6 +94,7 @@ class KnowledgeGraph:
|
|||
""");
|
||||
|
||||
# Table 3: Object-centric queries (get_o)
|
||||
# Compound partition key for optimal data distribution
|
||||
self.session.execute(f"""
|
||||
CREATE TABLE IF NOT EXISTS {self.object_table} (
|
||||
collection text,
|
||||
|
|
@ -138,7 +105,29 @@ class KnowledgeGraph:
|
|||
);
|
||||
""");
|
||||
|
||||
logger.info("Optimized multi-table schema initialized")
|
||||
# Table 4: Collection management and SPO queries (get_spo)
|
||||
# Simple partition key enables efficient collection deletion
|
||||
self.session.execute(f"""
|
||||
CREATE TABLE IF NOT EXISTS {self.collection_table} (
|
||||
collection text,
|
||||
s text,
|
||||
p text,
|
||||
o text,
|
||||
PRIMARY KEY (collection, s, p, o)
|
||||
);
|
||||
""");
|
||||
|
||||
# Table 5: Collection metadata tracking
|
||||
# Tracks which collections exist without polluting triple data
|
||||
self.session.execute(f"""
|
||||
CREATE TABLE IF NOT EXISTS {self.collection_metadata_table} (
|
||||
collection text,
|
||||
created_at timestamp,
|
||||
PRIMARY KEY (collection)
|
||||
);
|
||||
""");
|
||||
|
||||
logger.info("Optimized multi-table schema initialized (5 tables)")
|
||||
|
||||
def prepare_statements(self):
|
||||
"""Prepare statements for optimal performance"""
|
||||
|
|
@ -155,6 +144,10 @@ class KnowledgeGraph:
|
|||
f"INSERT INTO {self.object_table} (collection, o, s, p) VALUES (?, ?, ?, ?)"
|
||||
)
|
||||
|
||||
self.insert_collection_stmt = self.session.prepare(
|
||||
f"INSERT INTO {self.collection_table} (collection, s, p, o) VALUES (?, ?, ?, ?)"
|
||||
)
|
||||
|
||||
# Query statements for optimized access
|
||||
self.get_all_stmt = self.session.prepare(
|
||||
f"SELECT s, p, o FROM {self.subject_table} WHERE collection = ? LIMIT ? ALLOW FILTERING"
|
||||
|
|
@ -186,158 +179,168 @@ class KnowledgeGraph:
|
|||
)
|
||||
|
||||
self.get_spo_stmt = self.session.prepare(
|
||||
f"SELECT s as x FROM {self.subject_table} WHERE collection = ? AND s = ? AND p = ? AND o = ? LIMIT ?"
|
||||
f"SELECT s as x FROM {self.collection_table} WHERE collection = ? AND s = ? AND p = ? AND o = ? LIMIT ?"
|
||||
)
|
||||
|
||||
logger.info("Prepared statements initialized for optimal performance")
|
||||
logger.info("Prepared statements initialized for optimal performance (4 tables)")
|
||||
|
||||
def insert(self, collection, s, p, o):
|
||||
# Batch write to all four tables for consistency
|
||||
batch = BatchStatement()
|
||||
|
||||
if self.use_legacy:
|
||||
self.session.execute(
|
||||
f"insert into {self.table} (collection, s, p, o) values (%s, %s, %s, %s)",
|
||||
(collection, s, p, o)
|
||||
)
|
||||
else:
|
||||
# Batch write to all three tables for consistency
|
||||
batch = BatchStatement()
|
||||
# Insert into subject table
|
||||
batch.add(self.insert_subject_stmt, (collection, s, p, o))
|
||||
|
||||
# Insert into subject table
|
||||
batch.add(self.insert_subject_stmt, (collection, s, p, o))
|
||||
# Insert into predicate-object table (column order: collection, p, o, s)
|
||||
batch.add(self.insert_po_stmt, (collection, p, o, s))
|
||||
|
||||
# Insert into predicate-object table (column order: collection, p, o, s)
|
||||
batch.add(self.insert_po_stmt, (collection, p, o, s))
|
||||
# Insert into object table (column order: collection, o, s, p)
|
||||
batch.add(self.insert_object_stmt, (collection, o, s, p))
|
||||
|
||||
# Insert into object table (column order: collection, o, s, p)
|
||||
batch.add(self.insert_object_stmt, (collection, o, s, p))
|
||||
# Insert into collection table for SPO queries and deletion tracking
|
||||
batch.add(self.insert_collection_stmt, (collection, s, p, o))
|
||||
|
||||
self.session.execute(batch)
|
||||
self.session.execute(batch)
|
||||
|
||||
def get_all(self, collection, limit=50):
|
||||
if self.use_legacy:
|
||||
return self.session.execute(
|
||||
f"select s, p, o from {self.table} where collection = %s limit {limit}",
|
||||
(collection,)
|
||||
)
|
||||
else:
|
||||
# Use subject table for get_all queries
|
||||
return self.session.execute(
|
||||
self.get_all_stmt,
|
||||
(collection, limit)
|
||||
)
|
||||
# Use subject table for get_all queries
|
||||
return self.session.execute(
|
||||
self.get_all_stmt,
|
||||
(collection, limit)
|
||||
)
|
||||
|
||||
def get_s(self, collection, s, limit=10):
|
||||
if self.use_legacy:
|
||||
return self.session.execute(
|
||||
f"select p, o from {self.table} where collection = %s and s = %s limit {limit}",
|
||||
(collection, s)
|
||||
)
|
||||
else:
|
||||
# Optimized: Direct partition access with (collection, s)
|
||||
return self.session.execute(
|
||||
self.get_s_stmt,
|
||||
(collection, s, limit)
|
||||
)
|
||||
# Optimized: Direct partition access with (collection, s)
|
||||
return self.session.execute(
|
||||
self.get_s_stmt,
|
||||
(collection, s, limit)
|
||||
)
|
||||
|
||||
def get_p(self, collection, p, limit=10):
|
||||
if self.use_legacy:
|
||||
return self.session.execute(
|
||||
f"select s, o from {self.table} where collection = %s and p = %s limit {limit}",
|
||||
(collection, p)
|
||||
)
|
||||
else:
|
||||
# Optimized: Use po_table for direct partition access
|
||||
return self.session.execute(
|
||||
self.get_p_stmt,
|
||||
(collection, p, limit)
|
||||
)
|
||||
# Optimized: Use po_table for direct partition access
|
||||
return self.session.execute(
|
||||
self.get_p_stmt,
|
||||
(collection, p, limit)
|
||||
)
|
||||
|
||||
def get_o(self, collection, o, limit=10):
|
||||
if self.use_legacy:
|
||||
return self.session.execute(
|
||||
f"select s, p from {self.table} where collection = %s and o = %s limit {limit}",
|
||||
(collection, o)
|
||||
)
|
||||
else:
|
||||
# Optimized: Use object_table for direct partition access
|
||||
return self.session.execute(
|
||||
self.get_o_stmt,
|
||||
(collection, o, limit)
|
||||
)
|
||||
# Optimized: Use object_table for direct partition access
|
||||
return self.session.execute(
|
||||
self.get_o_stmt,
|
||||
(collection, o, limit)
|
||||
)
|
||||
|
||||
def get_sp(self, collection, s, p, limit=10):
|
||||
if self.use_legacy:
|
||||
return self.session.execute(
|
||||
f"select o from {self.table} where collection = %s and s = %s and p = %s limit {limit}",
|
||||
(collection, s, p)
|
||||
)
|
||||
else:
|
||||
# Optimized: Use subject_table with clustering key access
|
||||
return self.session.execute(
|
||||
self.get_sp_stmt,
|
||||
(collection, s, p, limit)
|
||||
)
|
||||
# Optimized: Use subject_table with clustering key access
|
||||
return self.session.execute(
|
||||
self.get_sp_stmt,
|
||||
(collection, s, p, limit)
|
||||
)
|
||||
|
||||
def get_po(self, collection, p, o, limit=10):
|
||||
if self.use_legacy:
|
||||
return self.session.execute(
|
||||
f"select s from {self.table} where collection = %s and p = %s and o = %s limit {limit} allow filtering",
|
||||
(collection, p, o)
|
||||
)
|
||||
else:
|
||||
# CRITICAL OPTIMIZATION: Use po_table - NO MORE ALLOW FILTERING!
|
||||
return self.session.execute(
|
||||
self.get_po_stmt,
|
||||
(collection, p, o, limit)
|
||||
)
|
||||
# CRITICAL OPTIMIZATION: Use po_table - NO MORE ALLOW FILTERING!
|
||||
return self.session.execute(
|
||||
self.get_po_stmt,
|
||||
(collection, p, o, limit)
|
||||
)
|
||||
|
||||
def get_os(self, collection, o, s, limit=10):
|
||||
if self.use_legacy:
|
||||
return self.session.execute(
|
||||
f"select p from {self.table} where collection = %s and o = %s and s = %s limit {limit} allow filtering",
|
||||
(collection, o, s)
|
||||
)
|
||||
else:
|
||||
# Optimized: Use subject_table with clustering access (no more ALLOW FILTERING)
|
||||
return self.session.execute(
|
||||
self.get_os_stmt,
|
||||
(collection, s, o, limit)
|
||||
)
|
||||
# Optimized: Use subject_table with clustering access (no more ALLOW FILTERING)
|
||||
return self.session.execute(
|
||||
self.get_os_stmt,
|
||||
(collection, s, o, limit)
|
||||
)
|
||||
|
||||
def get_spo(self, collection, s, p, o, limit=10):
|
||||
if self.use_legacy:
|
||||
return self.session.execute(
|
||||
f"""select s as x from {self.table} where collection = %s and s = %s and p = %s and o = %s limit {limit}""",
|
||||
(collection, s, p, o)
|
||||
# Optimized: Use collection_table for exact key lookup
|
||||
return self.session.execute(
|
||||
self.get_spo_stmt,
|
||||
(collection, s, p, o, limit)
|
||||
)
|
||||
|
||||
def collection_exists(self, collection):
|
||||
"""Check if collection exists by querying collection_metadata table"""
|
||||
try:
|
||||
result = self.session.execute(
|
||||
f"SELECT collection FROM {self.collection_metadata_table} WHERE collection = %s LIMIT 1",
|
||||
(collection,)
|
||||
)
|
||||
else:
|
||||
# Optimized: Use subject_table for exact key lookup
|
||||
return self.session.execute(
|
||||
self.get_spo_stmt,
|
||||
(collection, s, p, o, limit)
|
||||
return bool(list(result))
|
||||
except Exception as e:
|
||||
logger.error(f"Error checking collection existence: {e}")
|
||||
return False
|
||||
|
||||
def create_collection(self, collection):
|
||||
"""Create collection by inserting metadata row"""
|
||||
try:
|
||||
import datetime
|
||||
self.session.execute(
|
||||
f"INSERT INTO {self.collection_metadata_table} (collection, created_at) VALUES (%s, %s)",
|
||||
(collection, datetime.datetime.now())
|
||||
)
|
||||
logger.info(f"Created collection metadata for {collection}")
|
||||
except Exception as e:
|
||||
logger.error(f"Error creating collection: {e}")
|
||||
raise e
|
||||
|
||||
def delete_collection(self, collection):
|
||||
"""Delete all triples for a specific collection"""
|
||||
if self.use_legacy:
|
||||
self.session.execute(
|
||||
f"delete from {self.table} where collection = %s",
|
||||
(collection,)
|
||||
)
|
||||
else:
|
||||
# Delete from all three tables
|
||||
self.session.execute(
|
||||
f"delete from {self.subject_table} where collection = %s",
|
||||
(collection,)
|
||||
)
|
||||
self.session.execute(
|
||||
f"delete from {self.po_table} where collection = %s",
|
||||
(collection,)
|
||||
)
|
||||
self.session.execute(
|
||||
f"delete from {self.object_table} where collection = %s",
|
||||
(collection,)
|
||||
)
|
||||
"""Delete all triples for a specific collection
|
||||
|
||||
Uses collection_table to enumerate all triples, then deletes from all 4 tables
|
||||
using full partition keys for optimal performance with compound keys.
|
||||
"""
|
||||
# Step 1: Read all triples from collection_table (single partition read)
|
||||
rows = self.session.execute(
|
||||
f"SELECT s, p, o FROM {self.collection_table} WHERE collection = %s",
|
||||
(collection,)
|
||||
)
|
||||
|
||||
# Step 2: Delete each triple from all 4 tables using full partition keys
|
||||
# Batch deletions for efficiency
|
||||
batch = BatchStatement()
|
||||
count = 0
|
||||
|
||||
for row in rows:
|
||||
s, p, o = row.s, row.p, row.o
|
||||
|
||||
# Delete from subject table (partition key: collection, s)
|
||||
batch.add(SimpleStatement(
|
||||
f"DELETE FROM {self.subject_table} WHERE collection = ? AND s = ? AND p = ? AND o = ?"
|
||||
), (collection, s, p, o))
|
||||
|
||||
# Delete from predicate-object table (partition key: collection, p)
|
||||
batch.add(SimpleStatement(
|
||||
f"DELETE FROM {self.po_table} WHERE collection = ? AND p = ? AND o = ? AND s = ?"
|
||||
), (collection, p, o, s))
|
||||
|
||||
# Delete from object table (partition key: collection, o)
|
||||
batch.add(SimpleStatement(
|
||||
f"DELETE FROM {self.object_table} WHERE collection = ? AND o = ? AND s = ? AND p = ?"
|
||||
), (collection, o, s, p))
|
||||
|
||||
# Delete from collection table (partition key: collection only)
|
||||
batch.add(SimpleStatement(
|
||||
f"DELETE FROM {self.collection_table} WHERE collection = ? AND s = ? AND p = ? AND o = ?"
|
||||
), (collection, s, p, o))
|
||||
|
||||
count += 1
|
||||
|
||||
# Execute batch every 100 triples to avoid oversized batches
|
||||
if count % 100 == 0:
|
||||
self.session.execute(batch)
|
||||
batch = BatchStatement()
|
||||
|
||||
# Execute remaining deletions
|
||||
if count % 100 != 0:
|
||||
self.session.execute(batch)
|
||||
|
||||
# Step 3: Delete collection metadata
|
||||
self.session.execute(
|
||||
f"DELETE FROM {self.collection_metadata_table} WHERE collection = %s",
|
||||
(collection,)
|
||||
)
|
||||
|
||||
logger.info(f"Deleted {count} triples from collection {collection}")
|
||||
|
||||
def close(self):
|
||||
"""Close the Cassandra session and cluster connections properly"""
|
||||
|
|
|
|||
|
|
@ -49,6 +49,22 @@ class DocVectors:
|
|||
self.next_reload = time.time() + self.reload_time
|
||||
logger.debug(f"Reload at {self.next_reload}")
|
||||
|
||||
def collection_exists(self, user, collection):
|
||||
"""Check if collection exists (dimension-independent check)"""
|
||||
collection_name = make_safe_collection_name(user, collection, self.prefix)
|
||||
return self.client.has_collection(collection_name)
|
||||
|
||||
def create_collection(self, user, collection, dimension=384):
|
||||
"""Create collection with default dimension"""
|
||||
collection_name = make_safe_collection_name(user, collection, self.prefix)
|
||||
|
||||
if self.client.has_collection(collection_name):
|
||||
logger.info(f"Collection {collection_name} already exists")
|
||||
return
|
||||
|
||||
self.init_collection(dimension, user, collection)
|
||||
logger.info(f"Created Milvus collection {collection_name} with dimension {dimension}")
|
||||
|
||||
def init_collection(self, dimension, user, collection):
|
||||
|
||||
collection_name = make_safe_collection_name(user, collection, self.prefix)
|
||||
|
|
|
|||
|
|
@ -49,6 +49,22 @@ class EntityVectors:
|
|||
self.next_reload = time.time() + self.reload_time
|
||||
logger.debug(f"Reload at {self.next_reload}")
|
||||
|
||||
def collection_exists(self, user, collection):
|
||||
"""Check if collection exists (dimension-independent check)"""
|
||||
collection_name = make_safe_collection_name(user, collection, self.prefix)
|
||||
return self.client.has_collection(collection_name)
|
||||
|
||||
def create_collection(self, user, collection, dimension=384):
|
||||
"""Create collection with default dimension"""
|
||||
collection_name = make_safe_collection_name(user, collection, self.prefix)
|
||||
|
||||
if self.client.has_collection(collection_name):
|
||||
logger.info(f"Collection {collection_name} already exists")
|
||||
return
|
||||
|
||||
self.init_collection(dimension, user, collection)
|
||||
logger.info(f"Created Milvus collection {collection_name} with dimension {dimension}")
|
||||
|
||||
def init_collection(self, dimension, user, collection):
|
||||
|
||||
collection_name = make_safe_collection_name(user, collection, self.prefix)
|
||||
|
|
|
|||
|
|
@ -60,7 +60,7 @@ class CollectionManager:
|
|||
|
||||
async def ensure_collection_exists(self, user: str, collection: str):
|
||||
"""
|
||||
Ensure a collection exists, creating it if necessary (lazy creation)
|
||||
Ensure a collection exists, creating it if necessary with broadcast to storage
|
||||
|
||||
Args:
|
||||
user: User ID
|
||||
|
|
@ -74,7 +74,7 @@ class CollectionManager:
|
|||
return
|
||||
|
||||
# Create new collection with default metadata
|
||||
logger.info(f"Creating new collection {user}/{collection}")
|
||||
logger.info(f"Auto-creating collection {user}/{collection} from document submission")
|
||||
await self.table_store.create_collection(
|
||||
user=user,
|
||||
collection=collection,
|
||||
|
|
@ -83,10 +83,64 @@ class CollectionManager:
|
|||
tags=set()
|
||||
)
|
||||
|
||||
# Broadcast collection creation to all storage backends
|
||||
creation_key = (user, collection)
|
||||
logger.info(f"Broadcasting create-collection for {creation_key}")
|
||||
|
||||
self.pending_deletions[creation_key] = {
|
||||
"responses_pending": 3, # vector, object, triples
|
||||
"responses_received": [],
|
||||
"all_successful": True,
|
||||
"error_messages": [],
|
||||
"deletion_complete": asyncio.Event()
|
||||
}
|
||||
|
||||
storage_request = StorageManagementRequest(
|
||||
operation="create-collection",
|
||||
user=user,
|
||||
collection=collection
|
||||
)
|
||||
|
||||
# Send creation requests to all storage types
|
||||
if self.vector_storage_producer:
|
||||
await self.vector_storage_producer.send(storage_request)
|
||||
if self.object_storage_producer:
|
||||
await self.object_storage_producer.send(storage_request)
|
||||
if self.triples_storage_producer:
|
||||
await self.triples_storage_producer.send(storage_request)
|
||||
|
||||
# Wait for all storage creations to complete (with timeout)
|
||||
creation_info = self.pending_deletions[creation_key]
|
||||
try:
|
||||
await asyncio.wait_for(
|
||||
creation_info["deletion_complete"].wait(),
|
||||
timeout=30.0 # 30 second timeout
|
||||
)
|
||||
except asyncio.TimeoutError:
|
||||
logger.error(f"Timeout waiting for storage creation responses for {creation_key}")
|
||||
creation_info["all_successful"] = False
|
||||
creation_info["error_messages"].append("Timeout waiting for storage creation")
|
||||
|
||||
# Check if all creations succeeded
|
||||
if not creation_info["all_successful"]:
|
||||
error_msg = f"Storage creation failed: {'; '.join(creation_info['error_messages'])}"
|
||||
logger.error(error_msg)
|
||||
|
||||
# Clean up metadata on failure
|
||||
await self.table_store.delete_collection(user, collection)
|
||||
|
||||
# Clean up tracking
|
||||
del self.pending_deletions[creation_key]
|
||||
|
||||
raise RuntimeError(error_msg)
|
||||
|
||||
# Clean up tracking
|
||||
del self.pending_deletions[creation_key]
|
||||
logger.info(f"Collection {creation_key} auto-created successfully in all storage backends")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error ensuring collection exists: {e}")
|
||||
# Don't fail the operation if collection creation fails
|
||||
# This maintains backward compatibility
|
||||
raise e
|
||||
|
||||
async def list_collections(self, request: CollectionManagementRequest) -> CollectionManagementResponse:
|
||||
"""
|
||||
|
|
@ -154,6 +208,67 @@ class CollectionManager:
|
|||
tags=tags
|
||||
)
|
||||
|
||||
# Broadcast collection creation to all storage backends
|
||||
creation_key = (request.user, request.collection)
|
||||
logger.info(f"Broadcasting create-collection for {creation_key}")
|
||||
|
||||
self.pending_deletions[creation_key] = {
|
||||
"responses_pending": 3, # vector, object, triples
|
||||
"responses_received": [],
|
||||
"all_successful": True,
|
||||
"error_messages": [],
|
||||
"deletion_complete": asyncio.Event()
|
||||
}
|
||||
|
||||
storage_request = StorageManagementRequest(
|
||||
operation="create-collection",
|
||||
user=request.user,
|
||||
collection=request.collection
|
||||
)
|
||||
|
||||
# Send creation requests to all storage types
|
||||
if self.vector_storage_producer:
|
||||
await self.vector_storage_producer.send(storage_request)
|
||||
if self.object_storage_producer:
|
||||
await self.object_storage_producer.send(storage_request)
|
||||
if self.triples_storage_producer:
|
||||
await self.triples_storage_producer.send(storage_request)
|
||||
|
||||
# Wait for all storage creations to complete (with timeout)
|
||||
creation_info = self.pending_deletions[creation_key]
|
||||
try:
|
||||
await asyncio.wait_for(
|
||||
creation_info["deletion_complete"].wait(),
|
||||
timeout=30.0 # 30 second timeout
|
||||
)
|
||||
except asyncio.TimeoutError:
|
||||
logger.error(f"Timeout waiting for storage creation responses for {creation_key}")
|
||||
creation_info["all_successful"] = False
|
||||
creation_info["error_messages"].append("Timeout waiting for storage creation")
|
||||
|
||||
# Check if all creations succeeded
|
||||
if not creation_info["all_successful"]:
|
||||
error_msg = f"Storage creation failed: {'; '.join(creation_info['error_messages'])}"
|
||||
logger.error(error_msg)
|
||||
|
||||
# Clean up metadata on failure
|
||||
await self.table_store.delete_collection(request.user, request.collection)
|
||||
|
||||
# Clean up tracking
|
||||
del self.pending_deletions[creation_key]
|
||||
|
||||
return CollectionManagementResponse(
|
||||
error=Error(
|
||||
type="storage_creation_error",
|
||||
message=error_msg
|
||||
),
|
||||
timestamp=datetime.now().isoformat()
|
||||
)
|
||||
|
||||
# Clean up tracking
|
||||
del self.pending_deletions[creation_key]
|
||||
logger.info(f"Collection {creation_key} created successfully in all storage backends")
|
||||
|
||||
# Get the newly created collection for response
|
||||
created_collection = await self.table_store.get_collection(request.user, request.collection)
|
||||
|
||||
|
|
|
|||
|
|
@ -38,24 +38,10 @@ class Processor(DocumentEmbeddingsQueryService):
|
|||
)
|
||||
|
||||
self.qdrant = QdrantClient(url=store_uri, api_key=api_key)
|
||||
self.last_collection = None
|
||||
|
||||
def ensure_collection_exists(self, collection, dim):
|
||||
"""Ensure collection exists, create if it doesn't"""
|
||||
if collection != self.last_collection:
|
||||
if not self.qdrant.collection_exists(collection):
|
||||
try:
|
||||
self.qdrant.create_collection(
|
||||
collection_name=collection,
|
||||
vectors_config=VectorParams(
|
||||
size=dim, distance=Distance.COSINE
|
||||
),
|
||||
)
|
||||
logger.info(f"Created collection: {collection}")
|
||||
except Exception as e:
|
||||
logger.error(f"Qdrant collection creation failed: {e}")
|
||||
raise e
|
||||
self.last_collection = collection
|
||||
def collection_exists(self, collection):
|
||||
"""Check if collection exists (no implicit creation)"""
|
||||
return self.qdrant.collection_exists(collection)
|
||||
|
||||
async def query_document_embeddings(self, msg):
|
||||
|
||||
|
|
@ -63,15 +49,16 @@ class Processor(DocumentEmbeddingsQueryService):
|
|||
|
||||
chunks = []
|
||||
|
||||
collection = (
|
||||
"d_" + msg.user + "_" + msg.collection
|
||||
)
|
||||
|
||||
# Check if collection exists - return empty if not
|
||||
if not self.collection_exists(collection):
|
||||
logger.info(f"Collection {collection} does not exist, returning empty results")
|
||||
return []
|
||||
|
||||
for vec in msg.vectors:
|
||||
|
||||
dim = len(vec)
|
||||
collection = (
|
||||
"d_" + msg.user + "_" + msg.collection
|
||||
)
|
||||
|
||||
self.ensure_collection_exists(collection, dim)
|
||||
|
||||
search_result = self.qdrant.query_points(
|
||||
collection_name=collection,
|
||||
query=vec,
|
||||
|
|
|
|||
|
|
@ -38,24 +38,10 @@ class Processor(GraphEmbeddingsQueryService):
|
|||
)
|
||||
|
||||
self.qdrant = QdrantClient(url=store_uri, api_key=api_key)
|
||||
self.last_collection = None
|
||||
|
||||
def ensure_collection_exists(self, collection, dim):
|
||||
"""Ensure collection exists, create if it doesn't"""
|
||||
if collection != self.last_collection:
|
||||
if not self.qdrant.collection_exists(collection):
|
||||
try:
|
||||
self.qdrant.create_collection(
|
||||
collection_name=collection,
|
||||
vectors_config=VectorParams(
|
||||
size=dim, distance=Distance.COSINE
|
||||
),
|
||||
)
|
||||
logger.info(f"Created collection: {collection}")
|
||||
except Exception as e:
|
||||
logger.error(f"Qdrant collection creation failed: {e}")
|
||||
raise e
|
||||
self.last_collection = collection
|
||||
def collection_exists(self, collection):
|
||||
"""Check if collection exists (no implicit creation)"""
|
||||
return self.qdrant.collection_exists(collection)
|
||||
|
||||
def create_value(self, ent):
|
||||
if ent.startswith("http://") or ent.startswith("https://"):
|
||||
|
|
@ -70,15 +56,17 @@ class Processor(GraphEmbeddingsQueryService):
|
|||
entity_set = set()
|
||||
entities = []
|
||||
|
||||
collection = (
|
||||
"t_" + msg.user + "_" + msg.collection
|
||||
)
|
||||
|
||||
# Check if collection exists - return empty if not
|
||||
if not self.collection_exists(collection):
|
||||
logger.info(f"Collection {collection} does not exist, returning empty results")
|
||||
return []
|
||||
|
||||
for vec in msg.vectors:
|
||||
|
||||
dim = len(vec)
|
||||
collection = (
|
||||
"t_" + msg.user + "_" + msg.collection
|
||||
)
|
||||
|
||||
self.ensure_collection_exists(collection, dim)
|
||||
|
||||
# Heuristic hack, get (2*limit), so that we have more chance
|
||||
# of getting (limit) entities
|
||||
search_result = self.qdrant.query_points(
|
||||
|
|
|
|||
|
|
@ -60,19 +60,34 @@ class Processor(DocumentEmbeddingsStoreService):
|
|||
metrics=storage_response_metrics,
|
||||
)
|
||||
|
||||
async def start(self):
|
||||
"""Start the processor and its storage management consumer"""
|
||||
await super().start()
|
||||
await self.storage_request_consumer.start()
|
||||
await self.storage_response_producer.start()
|
||||
|
||||
async def store_document_embeddings(self, message):
|
||||
|
||||
# Validate collection exists before accepting writes
|
||||
if not self.vecstore.collection_exists(message.metadata.user, message.metadata.collection):
|
||||
error_msg = (
|
||||
f"Collection {message.metadata.collection} does not exist. "
|
||||
f"Create it first with tg-set-collection."
|
||||
)
|
||||
logger.error(error_msg)
|
||||
raise ValueError(error_msg)
|
||||
|
||||
for emb in message.chunks:
|
||||
|
||||
if emb.chunk is None or emb.chunk == b"": continue
|
||||
|
||||
|
||||
chunk = emb.chunk.decode("utf-8")
|
||||
if chunk == "": continue
|
||||
|
||||
for vec in emb.vectors:
|
||||
self.vecstore.insert(
|
||||
vec, chunk,
|
||||
message.metadata.user,
|
||||
vec, chunk,
|
||||
message.metadata.user,
|
||||
message.metadata.collection
|
||||
)
|
||||
|
||||
|
|
@ -87,18 +102,21 @@ class Processor(DocumentEmbeddingsStoreService):
|
|||
help=f'Milvus store URI (default: {default_store_uri})'
|
||||
)
|
||||
|
||||
async def on_storage_management(self, message):
|
||||
async def on_storage_management(self, message, consumer, flow):
|
||||
"""Handle storage management requests"""
|
||||
logger.info(f"Storage management request: {message.operation} for {message.user}/{message.collection}")
|
||||
request = message.value()
|
||||
logger.info(f"Storage management request: {request.operation} for {request.user}/{request.collection}")
|
||||
|
||||
try:
|
||||
if message.operation == "delete-collection":
|
||||
await self.handle_delete_collection(message)
|
||||
if request.operation == "create-collection":
|
||||
await self.handle_create_collection(request)
|
||||
elif request.operation == "delete-collection":
|
||||
await self.handle_delete_collection(request)
|
||||
else:
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="invalid_operation",
|
||||
message=f"Unknown operation: {message.operation}"
|
||||
message=f"Unknown operation: {request.operation}"
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
|
@ -113,17 +131,40 @@ class Processor(DocumentEmbeddingsStoreService):
|
|||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, message):
|
||||
async def handle_create_collection(self, request):
|
||||
"""Create a Milvus collection for document embeddings"""
|
||||
try:
|
||||
if self.vecstore.collection_exists(request.user, request.collection):
|
||||
logger.info(f"Collection {request.user}/{request.collection} already exists")
|
||||
else:
|
||||
self.vecstore.create_collection(request.user, request.collection)
|
||||
logger.info(f"Created collection {request.user}/{request.collection}")
|
||||
|
||||
# Send success response
|
||||
response = StorageManagementResponse(error=None)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to create collection: {e}", exc_info=True)
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="creation_error",
|
||||
message=str(e)
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, request):
|
||||
"""Delete the collection for document embeddings"""
|
||||
try:
|
||||
self.vecstore.delete_collection(message.user, message.collection)
|
||||
self.vecstore.delete_collection(request.user, request.collection)
|
||||
|
||||
# Send success response
|
||||
response = StorageManagementResponse(
|
||||
error=None # No error means success
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
logger.info(f"Successfully deleted collection {message.user}/{message.collection}")
|
||||
logger.info(f"Successfully deleted collection {request.user}/{request.collection}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to delete collection: {e}")
|
||||
|
|
|
|||
|
|
@ -115,38 +115,36 @@ class Processor(DocumentEmbeddingsStoreService):
|
|||
"Gave up waiting for index creation"
|
||||
)
|
||||
|
||||
async def start(self):
|
||||
"""Start the processor and its storage management consumer"""
|
||||
await super().start()
|
||||
await self.storage_request_consumer.start()
|
||||
await self.storage_response_producer.start()
|
||||
|
||||
async def store_document_embeddings(self, message):
|
||||
|
||||
index_name = (
|
||||
"d-" + message.metadata.user + "-" + message.metadata.collection
|
||||
)
|
||||
|
||||
# Validate collection exists before accepting writes
|
||||
if not self.pinecone.has_index(index_name):
|
||||
error_msg = (
|
||||
f"Collection {message.metadata.collection} does not exist. "
|
||||
f"Create it first with tg-set-collection."
|
||||
)
|
||||
logger.error(error_msg)
|
||||
raise ValueError(error_msg)
|
||||
|
||||
for emb in message.chunks:
|
||||
|
||||
if emb.chunk is None or emb.chunk == b"": continue
|
||||
|
||||
|
||||
chunk = emb.chunk.decode("utf-8")
|
||||
if chunk == "": continue
|
||||
|
||||
for vec in emb.vectors:
|
||||
|
||||
dim = len(vec)
|
||||
index_name = (
|
||||
"d-" + message.metadata.user + "-" + message.metadata.collection
|
||||
)
|
||||
|
||||
if index_name != self.last_index_name:
|
||||
|
||||
if not self.pinecone.has_index(index_name):
|
||||
|
||||
try:
|
||||
|
||||
self.create_index(index_name, dim)
|
||||
|
||||
except Exception as e:
|
||||
logger.error("Pinecone index creation failed")
|
||||
raise e
|
||||
|
||||
logger.info(f"Index {index_name} created")
|
||||
|
||||
self.last_index_name = index_name
|
||||
|
||||
index = self.pinecone.Index(index_name)
|
||||
|
||||
# Generate unique ID for each vector
|
||||
|
|
@ -192,18 +190,21 @@ class Processor(DocumentEmbeddingsStoreService):
|
|||
help=f'Pinecone region, (default: {default_region}'
|
||||
)
|
||||
|
||||
async def on_storage_management(self, message):
|
||||
async def on_storage_management(self, message, consumer, flow):
|
||||
"""Handle storage management requests"""
|
||||
logger.info(f"Storage management request: {message.operation} for {message.user}/{message.collection}")
|
||||
request = message.value()
|
||||
logger.info(f"Storage management request: {request.operation} for {request.user}/{request.collection}")
|
||||
|
||||
try:
|
||||
if message.operation == "delete-collection":
|
||||
await self.handle_delete_collection(message)
|
||||
if request.operation == "create-collection":
|
||||
await self.handle_create_collection(request)
|
||||
elif request.operation == "delete-collection":
|
||||
await self.handle_delete_collection(request)
|
||||
else:
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="invalid_operation",
|
||||
message=f"Unknown operation: {message.operation}"
|
||||
message=f"Unknown operation: {request.operation}"
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
|
@ -218,10 +219,36 @@ class Processor(DocumentEmbeddingsStoreService):
|
|||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, message):
|
||||
async def handle_create_collection(self, request):
|
||||
"""Create a Pinecone index for document embeddings"""
|
||||
try:
|
||||
index_name = f"d-{request.user}-{request.collection}"
|
||||
|
||||
if self.pinecone.has_index(index_name):
|
||||
logger.info(f"Pinecone index {index_name} already exists")
|
||||
else:
|
||||
# Create with default dimension - will need to be recreated if dimension doesn't match
|
||||
self.create_index(index_name, dim=384)
|
||||
logger.info(f"Created Pinecone index: {index_name}")
|
||||
|
||||
# Send success response
|
||||
response = StorageManagementResponse(error=None)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to create collection: {e}", exc_info=True)
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="creation_error",
|
||||
message=str(e)
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, request):
|
||||
"""Delete the collection for document embeddings"""
|
||||
try:
|
||||
index_name = f"d-{message.user}-{message.collection}"
|
||||
index_name = f"d-{request.user}-{request.collection}"
|
||||
|
||||
if self.pinecone.has_index(index_name):
|
||||
self.pinecone.delete_index(index_name)
|
||||
|
|
@ -234,7 +261,7 @@ class Processor(DocumentEmbeddingsStoreService):
|
|||
error=None # No error means success
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
logger.info(f"Successfully deleted collection {message.user}/{message.collection}")
|
||||
logger.info(f"Successfully deleted collection {request.user}/{request.collection}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to delete collection: {e}")
|
||||
|
|
|
|||
|
|
@ -36,8 +36,6 @@ class Processor(DocumentEmbeddingsStoreService):
|
|||
}
|
||||
)
|
||||
|
||||
self.last_collection = None
|
||||
|
||||
self.qdrant = QdrantClient(url=store_uri, api_key=api_key)
|
||||
|
||||
# Set up storage management if base class attributes are available
|
||||
|
|
@ -71,8 +69,30 @@ class Processor(DocumentEmbeddingsStoreService):
|
|||
metrics=storage_response_metrics,
|
||||
)
|
||||
|
||||
async def start(self):
|
||||
"""Start the processor and its storage management consumer"""
|
||||
await super().start()
|
||||
if hasattr(self, 'storage_request_consumer'):
|
||||
await self.storage_request_consumer.start()
|
||||
if hasattr(self, 'storage_response_producer'):
|
||||
await self.storage_response_producer.start()
|
||||
|
||||
async def store_document_embeddings(self, message):
|
||||
|
||||
# Validate collection exists before accepting writes
|
||||
collection = (
|
||||
"d_" + message.metadata.user + "_" +
|
||||
message.metadata.collection
|
||||
)
|
||||
|
||||
if not self.qdrant.collection_exists(collection):
|
||||
error_msg = (
|
||||
f"Collection {message.metadata.collection} does not exist. "
|
||||
f"Create it first with tg-set-collection."
|
||||
)
|
||||
logger.error(error_msg)
|
||||
raise ValueError(error_msg)
|
||||
|
||||
for emb in message.chunks:
|
||||
|
||||
chunk = emb.chunk.decode("utf-8")
|
||||
|
|
@ -80,29 +100,6 @@ class Processor(DocumentEmbeddingsStoreService):
|
|||
|
||||
for vec in emb.vectors:
|
||||
|
||||
dim = len(vec)
|
||||
collection = (
|
||||
"d_" + message.metadata.user + "_" +
|
||||
message.metadata.collection
|
||||
)
|
||||
|
||||
if collection != self.last_collection:
|
||||
|
||||
if not self.qdrant.collection_exists(collection):
|
||||
|
||||
try:
|
||||
self.qdrant.create_collection(
|
||||
collection_name=collection,
|
||||
vectors_config=VectorParams(
|
||||
size=dim, distance=Distance.COSINE
|
||||
),
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error("Qdrant collection creation failed")
|
||||
raise e
|
||||
|
||||
self.last_collection = collection
|
||||
|
||||
self.qdrant.upsert(
|
||||
collection_name=collection,
|
||||
points=[
|
||||
|
|
@ -133,18 +130,21 @@ class Processor(DocumentEmbeddingsStoreService):
|
|||
help=f'Qdrant API key (default: None)'
|
||||
)
|
||||
|
||||
async def on_storage_management(self, message):
|
||||
async def on_storage_management(self, message, consumer, flow):
|
||||
"""Handle storage management requests"""
|
||||
logger.info(f"Storage management request: {message.operation} for {message.user}/{message.collection}")
|
||||
request = message.value()
|
||||
logger.info(f"Storage management request: {request.operation} for {request.user}/{request.collection}")
|
||||
|
||||
try:
|
||||
if message.operation == "delete-collection":
|
||||
await self.handle_delete_collection(message)
|
||||
if request.operation == "create-collection":
|
||||
await self.handle_create_collection(request)
|
||||
elif request.operation == "delete-collection":
|
||||
await self.handle_delete_collection(request)
|
||||
else:
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="invalid_operation",
|
||||
message=f"Unknown operation: {message.operation}"
|
||||
message=f"Unknown operation: {request.operation}"
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
|
@ -159,10 +159,43 @@ class Processor(DocumentEmbeddingsStoreService):
|
|||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, message):
|
||||
async def handle_create_collection(self, request):
|
||||
"""Create a Qdrant collection for document embeddings"""
|
||||
try:
|
||||
collection_name = f"d_{request.user}_{request.collection}"
|
||||
|
||||
if self.qdrant.collection_exists(collection_name):
|
||||
logger.info(f"Qdrant collection {collection_name} already exists")
|
||||
else:
|
||||
# Create collection with default dimension (will be recreated with correct dim on first write if needed)
|
||||
# Using a placeholder dimension - actual dimension determined by first embedding
|
||||
self.qdrant.create_collection(
|
||||
collection_name=collection_name,
|
||||
vectors_config=VectorParams(
|
||||
size=384, # Default dimension, common for many models
|
||||
distance=Distance.COSINE
|
||||
)
|
||||
)
|
||||
logger.info(f"Created Qdrant collection: {collection_name}")
|
||||
|
||||
# Send success response
|
||||
response = StorageManagementResponse(error=None)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to create collection: {e}", exc_info=True)
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="creation_error",
|
||||
message=str(e)
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, request):
|
||||
"""Delete the collection for document embeddings"""
|
||||
try:
|
||||
collection_name = f"d_{message.user}_{message.collection}"
|
||||
collection_name = f"d_{request.user}_{request.collection}"
|
||||
|
||||
if self.qdrant.collection_exists(collection_name):
|
||||
self.qdrant.delete_collection(collection_name)
|
||||
|
|
@ -175,7 +208,7 @@ class Processor(DocumentEmbeddingsStoreService):
|
|||
error=None # No error means success
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
logger.info(f"Successfully deleted collection {message.user}/{message.collection}")
|
||||
logger.info(f"Successfully deleted collection {request.user}/{request.collection}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to delete collection: {e}")
|
||||
|
|
|
|||
|
|
@ -60,8 +60,23 @@ class Processor(GraphEmbeddingsStoreService):
|
|||
metrics=storage_response_metrics,
|
||||
)
|
||||
|
||||
async def start(self):
|
||||
"""Start the processor and its storage management consumer"""
|
||||
await super().start()
|
||||
await self.storage_request_consumer.start()
|
||||
await self.storage_response_producer.start()
|
||||
|
||||
async def store_graph_embeddings(self, message):
|
||||
|
||||
# Validate collection exists before accepting writes
|
||||
if not self.vecstore.collection_exists(message.metadata.user, message.metadata.collection):
|
||||
error_msg = (
|
||||
f"Collection {message.metadata.collection} does not exist. "
|
||||
f"Create it first with tg-set-collection."
|
||||
)
|
||||
logger.error(error_msg)
|
||||
raise ValueError(error_msg)
|
||||
|
||||
for entity in message.entities:
|
||||
|
||||
if entity.entity.value != "" and entity.entity.value is not None:
|
||||
|
|
@ -83,18 +98,21 @@ class Processor(GraphEmbeddingsStoreService):
|
|||
help=f'Milvus store URI (default: {default_store_uri})'
|
||||
)
|
||||
|
||||
async def on_storage_management(self, message):
|
||||
async def on_storage_management(self, message, consumer, flow):
|
||||
"""Handle storage management requests"""
|
||||
logger.info(f"Storage management request: {message.operation} for {message.user}/{message.collection}")
|
||||
request = message.value()
|
||||
logger.info(f"Storage management request: {request.operation} for {request.user}/{request.collection}")
|
||||
|
||||
try:
|
||||
if message.operation == "delete-collection":
|
||||
await self.handle_delete_collection(message)
|
||||
if request.operation == "create-collection":
|
||||
await self.handle_create_collection(request)
|
||||
elif request.operation == "delete-collection":
|
||||
await self.handle_delete_collection(request)
|
||||
else:
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="invalid_operation",
|
||||
message=f"Unknown operation: {message.operation}"
|
||||
message=f"Unknown operation: {request.operation}"
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
|
@ -109,17 +127,40 @@ class Processor(GraphEmbeddingsStoreService):
|
|||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, message):
|
||||
async def handle_create_collection(self, request):
|
||||
"""Create a Milvus collection for graph embeddings"""
|
||||
try:
|
||||
if self.vecstore.collection_exists(request.user, request.collection):
|
||||
logger.info(f"Collection {request.user}/{request.collection} already exists")
|
||||
else:
|
||||
self.vecstore.create_collection(request.user, request.collection)
|
||||
logger.info(f"Created collection {request.user}/{request.collection}")
|
||||
|
||||
# Send success response
|
||||
response = StorageManagementResponse(error=None)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to create collection: {e}", exc_info=True)
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="creation_error",
|
||||
message=str(e)
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, request):
|
||||
"""Delete the collection for graph embeddings"""
|
||||
try:
|
||||
self.vecstore.delete_collection(message.user, message.collection)
|
||||
self.vecstore.delete_collection(request.user, request.collection)
|
||||
|
||||
# Send success response
|
||||
response = StorageManagementResponse(
|
||||
error=None # No error means success
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
logger.info(f"Successfully deleted collection {message.user}/{message.collection}")
|
||||
logger.info(f"Successfully deleted collection {request.user}/{request.collection}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to delete collection: {e}")
|
||||
|
|
|
|||
|
|
@ -115,8 +115,27 @@ class Processor(GraphEmbeddingsStoreService):
|
|||
"Gave up waiting for index creation"
|
||||
)
|
||||
|
||||
async def start(self):
|
||||
"""Start the processor and its storage management consumer"""
|
||||
await super().start()
|
||||
await self.storage_request_consumer.start()
|
||||
await self.storage_response_producer.start()
|
||||
|
||||
async def store_graph_embeddings(self, message):
|
||||
|
||||
index_name = (
|
||||
"t-" + message.metadata.user + "-" + message.metadata.collection
|
||||
)
|
||||
|
||||
# Validate collection exists before accepting writes
|
||||
if not self.pinecone.has_index(index_name):
|
||||
error_msg = (
|
||||
f"Collection {message.metadata.collection} does not exist. "
|
||||
f"Create it first with tg-set-collection."
|
||||
)
|
||||
logger.error(error_msg)
|
||||
raise ValueError(error_msg)
|
||||
|
||||
for entity in message.entities:
|
||||
|
||||
if entity.entity.value == "" or entity.entity.value is None:
|
||||
|
|
@ -124,28 +143,6 @@ class Processor(GraphEmbeddingsStoreService):
|
|||
|
||||
for vec in entity.vectors:
|
||||
|
||||
dim = len(vec)
|
||||
|
||||
index_name = (
|
||||
"t-" + message.metadata.user + "-" + message.metadata.collection
|
||||
)
|
||||
|
||||
if index_name != self.last_index_name:
|
||||
|
||||
if not self.pinecone.has_index(index_name):
|
||||
|
||||
try:
|
||||
|
||||
self.create_index(index_name, dim)
|
||||
|
||||
except Exception as e:
|
||||
logger.error("Pinecone index creation failed")
|
||||
raise e
|
||||
|
||||
logger.info(f"Index {index_name} created")
|
||||
|
||||
self.last_index_name = index_name
|
||||
|
||||
index = self.pinecone.Index(index_name)
|
||||
|
||||
# Generate unique ID for each vector
|
||||
|
|
@ -191,18 +188,21 @@ class Processor(GraphEmbeddingsStoreService):
|
|||
help=f'Pinecone region, (default: {default_region}'
|
||||
)
|
||||
|
||||
async def on_storage_management(self, message):
|
||||
async def on_storage_management(self, message, consumer, flow):
|
||||
"""Handle storage management requests"""
|
||||
logger.info(f"Storage management request: {message.operation} for {message.user}/{message.collection}")
|
||||
request = message.value()
|
||||
logger.info(f"Storage management request: {request.operation} for {request.user}/{request.collection}")
|
||||
|
||||
try:
|
||||
if message.operation == "delete-collection":
|
||||
await self.handle_delete_collection(message)
|
||||
if request.operation == "create-collection":
|
||||
await self.handle_create_collection(request)
|
||||
elif request.operation == "delete-collection":
|
||||
await self.handle_delete_collection(request)
|
||||
else:
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="invalid_operation",
|
||||
message=f"Unknown operation: {message.operation}"
|
||||
message=f"Unknown operation: {request.operation}"
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
|
@ -217,10 +217,36 @@ class Processor(GraphEmbeddingsStoreService):
|
|||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, message):
|
||||
async def handle_create_collection(self, request):
|
||||
"""Create a Pinecone index for graph embeddings"""
|
||||
try:
|
||||
index_name = f"t-{request.user}-{request.collection}"
|
||||
|
||||
if self.pinecone.has_index(index_name):
|
||||
logger.info(f"Pinecone index {index_name} already exists")
|
||||
else:
|
||||
# Create with default dimension - will need to be recreated if dimension doesn't match
|
||||
self.create_index(index_name, dim=384)
|
||||
logger.info(f"Created Pinecone index: {index_name}")
|
||||
|
||||
# Send success response
|
||||
response = StorageManagementResponse(error=None)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to create collection: {e}", exc_info=True)
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="creation_error",
|
||||
message=str(e)
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, request):
|
||||
"""Delete the collection for graph embeddings"""
|
||||
try:
|
||||
index_name = f"t-{message.user}-{message.collection}"
|
||||
index_name = f"t-{request.user}-{request.collection}"
|
||||
|
||||
if self.pinecone.has_index(index_name):
|
||||
self.pinecone.delete_index(index_name)
|
||||
|
|
@ -233,7 +259,7 @@ class Processor(GraphEmbeddingsStoreService):
|
|||
error=None # No error means success
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
logger.info(f"Successfully deleted collection {message.user}/{message.collection}")
|
||||
logger.info(f"Successfully deleted collection {request.user}/{request.collection}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to delete collection: {e}")
|
||||
|
|
|
|||
|
|
@ -36,8 +36,6 @@ class Processor(GraphEmbeddingsStoreService):
|
|||
}
|
||||
)
|
||||
|
||||
self.last_collection = None
|
||||
|
||||
self.qdrant = QdrantClient(url=store_uri, api_key=api_key)
|
||||
|
||||
# Set up storage management if base class attributes are available
|
||||
|
|
@ -71,31 +69,30 @@ class Processor(GraphEmbeddingsStoreService):
|
|||
metrics=storage_response_metrics,
|
||||
)
|
||||
|
||||
def get_collection(self, dim, user, collection):
|
||||
|
||||
def get_collection(self, user, collection):
|
||||
"""Get collection name and validate it exists"""
|
||||
cname = (
|
||||
"t_" + user + "_" + collection
|
||||
)
|
||||
|
||||
if cname != self.last_collection:
|
||||
|
||||
if not self.qdrant.collection_exists(cname):
|
||||
|
||||
try:
|
||||
self.qdrant.create_collection(
|
||||
collection_name=cname,
|
||||
vectors_config=VectorParams(
|
||||
size=dim, distance=Distance.COSINE
|
||||
),
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error("Qdrant collection creation failed")
|
||||
raise e
|
||||
|
||||
self.last_collection = cname
|
||||
if not self.qdrant.collection_exists(cname):
|
||||
error_msg = (
|
||||
f"Collection {collection} does not exist. "
|
||||
f"Create it first with tg-set-collection."
|
||||
)
|
||||
logger.error(error_msg)
|
||||
raise ValueError(error_msg)
|
||||
|
||||
return cname
|
||||
|
||||
async def start(self):
|
||||
"""Start the processor and its storage management consumer"""
|
||||
await super().start()
|
||||
if hasattr(self, 'storage_request_consumer'):
|
||||
await self.storage_request_consumer.start()
|
||||
if hasattr(self, 'storage_response_producer'):
|
||||
await self.storage_response_producer.start()
|
||||
|
||||
async def store_graph_embeddings(self, message):
|
||||
|
||||
for entity in message.entities:
|
||||
|
|
@ -104,10 +101,8 @@ class Processor(GraphEmbeddingsStoreService):
|
|||
|
||||
for vec in entity.vectors:
|
||||
|
||||
dim = len(vec)
|
||||
|
||||
collection = self.get_collection(
|
||||
dim, message.metadata.user, message.metadata.collection
|
||||
message.metadata.user, message.metadata.collection
|
||||
)
|
||||
|
||||
self.qdrant.upsert(
|
||||
|
|
@ -140,18 +135,21 @@ class Processor(GraphEmbeddingsStoreService):
|
|||
help=f'Qdrant API key'
|
||||
)
|
||||
|
||||
async def on_storage_management(self, message):
|
||||
async def on_storage_management(self, message, consumer, flow):
|
||||
"""Handle storage management requests"""
|
||||
logger.info(f"Storage management request: {message.operation} for {message.user}/{message.collection}")
|
||||
request = message.value()
|
||||
logger.info(f"Storage management request: {request.operation} for {request.user}/{request.collection}")
|
||||
|
||||
try:
|
||||
if message.operation == "delete-collection":
|
||||
await self.handle_delete_collection(message)
|
||||
if request.operation == "create-collection":
|
||||
await self.handle_create_collection(request)
|
||||
elif request.operation == "delete-collection":
|
||||
await self.handle_delete_collection(request)
|
||||
else:
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="invalid_operation",
|
||||
message=f"Unknown operation: {message.operation}"
|
||||
message=f"Unknown operation: {request.operation}"
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
|
@ -166,10 +164,43 @@ class Processor(GraphEmbeddingsStoreService):
|
|||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, message):
|
||||
async def handle_create_collection(self, request):
|
||||
"""Create a Qdrant collection for graph embeddings"""
|
||||
try:
|
||||
collection_name = f"t_{request.user}_{request.collection}"
|
||||
|
||||
if self.qdrant.collection_exists(collection_name):
|
||||
logger.info(f"Qdrant collection {collection_name} already exists")
|
||||
else:
|
||||
# Create collection with default dimension (will be recreated with correct dim on first write if needed)
|
||||
# Using a placeholder dimension - actual dimension determined by first embedding
|
||||
self.qdrant.create_collection(
|
||||
collection_name=collection_name,
|
||||
vectors_config=VectorParams(
|
||||
size=384, # Default dimension, common for many models
|
||||
distance=Distance.COSINE
|
||||
)
|
||||
)
|
||||
logger.info(f"Created Qdrant collection: {collection_name}")
|
||||
|
||||
# Send success response
|
||||
response = StorageManagementResponse(error=None)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to create collection: {e}", exc_info=True)
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="creation_error",
|
||||
message=str(e)
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, request):
|
||||
"""Delete the collection for graph embeddings"""
|
||||
try:
|
||||
collection_name = f"t_{message.user}_{message.collection}"
|
||||
collection_name = f"t_{request.user}_{request.collection}"
|
||||
|
||||
if self.qdrant.collection_exists(collection_name):
|
||||
self.qdrant.delete_collection(collection_name)
|
||||
|
|
@ -182,7 +213,7 @@ class Processor(GraphEmbeddingsStoreService):
|
|||
error=None # No error means success
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
logger.info(f"Successfully deleted collection {message.user}/{message.collection}")
|
||||
logger.info(f"Successfully deleted collection {request.user}/{request.collection}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to delete collection: {e}")
|
||||
|
|
|
|||
|
|
@ -295,6 +295,8 @@ class Processor(FlowProcessor):
|
|||
|
||||
try:
|
||||
self.session.execute(create_table_cql)
|
||||
if keyspace not in self.known_tables:
|
||||
self.known_tables[keyspace] = set()
|
||||
self.known_tables[keyspace].add(table_key)
|
||||
logger.info(f"Ensured table exists: {safe_keyspace}.{safe_table}")
|
||||
|
||||
|
|
@ -340,18 +342,47 @@ class Processor(FlowProcessor):
|
|||
logger.warning(f"Failed to convert value {value} to type {field_type}: {e}")
|
||||
return str(value)
|
||||
|
||||
async def start(self):
|
||||
"""Start the processor and its storage management consumer"""
|
||||
await super().start()
|
||||
await self.storage_request_consumer.start()
|
||||
await self.storage_response_producer.start()
|
||||
|
||||
async def on_object(self, msg, consumer, flow):
|
||||
"""Process incoming ExtractedObject and store in Cassandra"""
|
||||
|
||||
|
||||
obj = msg.value()
|
||||
logger.info(f"Storing {len(obj.values)} objects for schema {obj.schema_name} from {obj.metadata.id}")
|
||||
|
||||
|
||||
# Validate collection/keyspace exists before accepting writes
|
||||
safe_keyspace = self.sanitize_name(obj.metadata.user)
|
||||
if safe_keyspace not in self.known_keyspaces:
|
||||
# Check if keyspace actually exists in Cassandra
|
||||
self.connect_cassandra()
|
||||
check_keyspace_cql = """
|
||||
SELECT keyspace_name FROM system_schema.keyspaces
|
||||
WHERE keyspace_name = %s
|
||||
"""
|
||||
result = self.session.execute(check_keyspace_cql, (safe_keyspace,))
|
||||
# Check if result is None (mock case) or has no rows
|
||||
if result is None or not result.one():
|
||||
error_msg = (
|
||||
f"Collection {obj.metadata.collection} does not exist. "
|
||||
f"Create it first with tg-set-collection."
|
||||
)
|
||||
logger.error(error_msg)
|
||||
raise ValueError(error_msg)
|
||||
# Cache it if it exists
|
||||
self.known_keyspaces.add(safe_keyspace)
|
||||
if safe_keyspace not in self.known_tables:
|
||||
self.known_tables[safe_keyspace] = set()
|
||||
|
||||
# Get schema definition
|
||||
schema = self.schemas.get(obj.schema_name)
|
||||
if not schema:
|
||||
logger.warning(f"No schema found for {obj.schema_name} - skipping")
|
||||
return
|
||||
|
||||
|
||||
# Ensure table exists
|
||||
keyspace = obj.metadata.user
|
||||
table_name = obj.schema_name
|
||||
|
|
@ -428,7 +459,16 @@ class Processor(FlowProcessor):
|
|||
logger.info(f"Received storage management request: {msg.operation} for {msg.user}/{msg.collection}")
|
||||
|
||||
try:
|
||||
if msg.operation == "delete-collection":
|
||||
if msg.operation == "create-collection":
|
||||
await self.create_collection(msg.user, msg.collection)
|
||||
|
||||
# Send success response
|
||||
response = StorageManagementResponse(
|
||||
error=None # No error means success
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
logger.info(f"Successfully created collection {msg.user}/{msg.collection}")
|
||||
elif msg.operation == "delete-collection":
|
||||
await self.delete_collection(msg.user, msg.collection)
|
||||
|
||||
# Send success response
|
||||
|
|
@ -459,7 +499,25 @@ class Processor(FlowProcessor):
|
|||
message=str(e)
|
||||
)
|
||||
)
|
||||
await self.send("storage-response", response)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def create_collection(self, user: str, collection: str):
|
||||
"""Create/verify collection exists in Cassandra object store"""
|
||||
# Connect if not already connected
|
||||
self.connect_cassandra()
|
||||
|
||||
# Sanitize names for safety
|
||||
safe_keyspace = self.sanitize_name(user)
|
||||
|
||||
# Ensure keyspace exists
|
||||
if safe_keyspace not in self.known_keyspaces:
|
||||
self.ensure_keyspace(safe_keyspace)
|
||||
self.known_keyspaces.add(safe_keyspace)
|
||||
|
||||
# For Cassandra objects, collection is just a property in rows
|
||||
# No need to create separate tables per collection
|
||||
# Just mark that we've seen this collection
|
||||
logger.info(f"Collection {collection} ready for user {user} (using keyspace {safe_keyspace})")
|
||||
|
||||
async def delete_collection(self, user: str, collection: str):
|
||||
"""Delete all data for a specific collection"""
|
||||
|
|
|
|||
|
|
@ -109,6 +109,15 @@ class Processor(TriplesStoreService):
|
|||
|
||||
self.table = user
|
||||
|
||||
# Validate collection exists before accepting writes
|
||||
if not self.tg.collection_exists(message.metadata.collection):
|
||||
error_msg = (
|
||||
f"Collection {message.metadata.collection} does not exist. "
|
||||
f"Create it first with tg-set-collection."
|
||||
)
|
||||
logger.error(error_msg)
|
||||
raise ValueError(error_msg)
|
||||
|
||||
for t in message.triples:
|
||||
self.tg.insert(
|
||||
message.metadata.collection,
|
||||
|
|
@ -117,18 +126,27 @@ class Processor(TriplesStoreService):
|
|||
t.o.value
|
||||
)
|
||||
|
||||
async def on_storage_management(self, message):
|
||||
async def start(self):
|
||||
"""Start the processor and its storage management consumer"""
|
||||
await super().start()
|
||||
await self.storage_request_consumer.start()
|
||||
await self.storage_response_producer.start()
|
||||
|
||||
async def on_storage_management(self, message, consumer, flow):
|
||||
"""Handle storage management requests"""
|
||||
logger.info(f"Storage management request: {message.operation} for {message.user}/{message.collection}")
|
||||
request = message.value()
|
||||
logger.info(f"Storage management request: {request.operation} for {request.user}/{request.collection}")
|
||||
|
||||
try:
|
||||
if message.operation == "delete-collection":
|
||||
await self.handle_delete_collection(message)
|
||||
if request.operation == "create-collection":
|
||||
await self.handle_create_collection(request)
|
||||
elif request.operation == "delete-collection":
|
||||
await self.handle_delete_collection(request)
|
||||
else:
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="invalid_operation",
|
||||
message=f"Unknown operation: {message.operation}"
|
||||
message=f"Unknown operation: {request.operation}"
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
|
@ -143,42 +161,85 @@ class Processor(TriplesStoreService):
|
|||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, message):
|
||||
"""Delete all data for a specific collection from the unified triples table"""
|
||||
async def handle_create_collection(self, request):
|
||||
"""Create a collection in Cassandra triple store"""
|
||||
try:
|
||||
# Create or reuse connection for this user's keyspace
|
||||
if self.table is None or self.table != message.user:
|
||||
if self.table is None or self.table != request.user:
|
||||
self.tg = None
|
||||
|
||||
try:
|
||||
if self.cassandra_username and self.cassandra_password:
|
||||
self.tg = KnowledgeGraph(
|
||||
hosts=self.cassandra_host,
|
||||
keyspace=message.user,
|
||||
keyspace=request.user,
|
||||
username=self.cassandra_username,
|
||||
password=self.cassandra_password
|
||||
)
|
||||
else:
|
||||
self.tg = KnowledgeGraph(
|
||||
hosts=self.cassandra_host,
|
||||
keyspace=message.user,
|
||||
keyspace=request.user,
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to connect to Cassandra for user {message.user}: {e}")
|
||||
logger.error(f"Failed to connect to Cassandra for user {request.user}: {e}")
|
||||
raise
|
||||
|
||||
self.table = message.user
|
||||
self.table = request.user
|
||||
|
||||
# Delete all triples for this collection from the unified table
|
||||
# In the unified table schema, collection is the partition key
|
||||
delete_cql = """
|
||||
DELETE FROM triples
|
||||
WHERE collection = ?
|
||||
"""
|
||||
# Create collection using the built-in method
|
||||
logger.info(f"Creating collection {request.collection} for user {request.user}")
|
||||
|
||||
if self.tg.collection_exists(request.collection):
|
||||
logger.info(f"Collection {request.collection} already exists")
|
||||
else:
|
||||
self.tg.create_collection(request.collection)
|
||||
logger.info(f"Created collection {request.collection}")
|
||||
|
||||
# Send success response
|
||||
response = StorageManagementResponse(error=None)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to create collection: {e}", exc_info=True)
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="creation_error",
|
||||
message=str(e)
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, request):
|
||||
"""Delete all data for a specific collection from the unified triples table"""
|
||||
try:
|
||||
# Create or reuse connection for this user's keyspace
|
||||
if self.table is None or self.table != request.user:
|
||||
self.tg = None
|
||||
|
||||
try:
|
||||
if self.cassandra_username and self.cassandra_password:
|
||||
self.tg = KnowledgeGraph(
|
||||
hosts=self.cassandra_host,
|
||||
keyspace=request.user,
|
||||
username=self.cassandra_username,
|
||||
password=self.cassandra_password
|
||||
)
|
||||
else:
|
||||
self.tg = KnowledgeGraph(
|
||||
hosts=self.cassandra_host,
|
||||
keyspace=request.user,
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to connect to Cassandra for user {request.user}: {e}")
|
||||
raise
|
||||
|
||||
self.table = request.user
|
||||
|
||||
# Delete all triples for this collection using the built-in method
|
||||
try:
|
||||
self.tg.session.execute(delete_cql, (message.collection,))
|
||||
logger.info(f"Deleted all triples for collection {message.collection} from keyspace {message.user}")
|
||||
self.tg.delete_collection(request.collection)
|
||||
logger.info(f"Deleted all triples for collection {request.collection} from keyspace {request.user}")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to delete collection data: {e}")
|
||||
raise
|
||||
|
|
@ -188,7 +249,7 @@ class Processor(TriplesStoreService):
|
|||
error=None # No error means success
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
logger.info(f"Successfully deleted collection {message.user}/{message.collection}")
|
||||
logger.info(f"Successfully deleted collection {request.user}/{request.collection}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to delete collection: {e}")
|
||||
|
|
|
|||
|
|
@ -152,11 +152,43 @@ class Processor(TriplesStoreService):
|
|||
time=res.run_time_ms
|
||||
))
|
||||
|
||||
def collection_exists(self, user, collection):
|
||||
"""Check if collection metadata node exists"""
|
||||
result = self.io.query(
|
||||
"MATCH (c:CollectionMetadata {user: $user, collection: $collection}) "
|
||||
"RETURN c LIMIT 1",
|
||||
params={"user": user, "collection": collection}
|
||||
)
|
||||
return result.result_set is not None and len(result.result_set) > 0
|
||||
|
||||
def create_collection(self, user, collection):
|
||||
"""Create collection metadata node"""
|
||||
import datetime
|
||||
self.io.query(
|
||||
"MERGE (c:CollectionMetadata {user: $user, collection: $collection}) "
|
||||
"SET c.created_at = $created_at",
|
||||
params={
|
||||
"user": user,
|
||||
"collection": collection,
|
||||
"created_at": datetime.datetime.now().isoformat()
|
||||
}
|
||||
)
|
||||
logger.info(f"Created collection metadata node for {user}/{collection}")
|
||||
|
||||
async def store_triples(self, message):
|
||||
# Extract user and collection from metadata
|
||||
user = message.metadata.user if message.metadata.user else "default"
|
||||
collection = message.metadata.collection if message.metadata.collection else "default"
|
||||
|
||||
# Validate collection exists before accepting writes
|
||||
if not self.collection_exists(user, collection):
|
||||
error_msg = (
|
||||
f"Collection {collection} does not exist. "
|
||||
f"Create it first with tg-set-collection."
|
||||
)
|
||||
logger.error(error_msg)
|
||||
raise ValueError(error_msg)
|
||||
|
||||
for t in message.triples:
|
||||
|
||||
self.create_node(t.s.value, user, collection)
|
||||
|
|
@ -185,18 +217,27 @@ class Processor(TriplesStoreService):
|
|||
help=f'FalkorDB database (default: {default_database})'
|
||||
)
|
||||
|
||||
async def on_storage_management(self, message):
|
||||
async def start(self):
|
||||
"""Start the processor and its storage management consumer"""
|
||||
await super().start()
|
||||
await self.storage_request_consumer.start()
|
||||
await self.storage_response_producer.start()
|
||||
|
||||
async def on_storage_management(self, message, consumer, flow):
|
||||
"""Handle storage management requests"""
|
||||
logger.info(f"Storage management request: {message.operation} for {message.user}/{message.collection}")
|
||||
request = message.value()
|
||||
logger.info(f"Storage management request: {request.operation} for {request.user}/{request.collection}")
|
||||
|
||||
try:
|
||||
if message.operation == "delete-collection":
|
||||
await self.handle_delete_collection(message)
|
||||
if request.operation == "create-collection":
|
||||
await self.handle_create_collection(request)
|
||||
elif request.operation == "delete-collection":
|
||||
await self.handle_delete_collection(request)
|
||||
else:
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="invalid_operation",
|
||||
message=f"Unknown operation: {message.operation}"
|
||||
message=f"Unknown operation: {request.operation}"
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
|
@ -211,28 +252,57 @@ class Processor(TriplesStoreService):
|
|||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, message):
|
||||
async def handle_create_collection(self, request):
|
||||
"""Create collection metadata in FalkorDB"""
|
||||
try:
|
||||
if self.collection_exists(request.user, request.collection):
|
||||
logger.info(f"Collection {request.user}/{request.collection} already exists")
|
||||
else:
|
||||
self.create_collection(request.user, request.collection)
|
||||
logger.info(f"Created collection {request.user}/{request.collection}")
|
||||
|
||||
# Send success response
|
||||
response = StorageManagementResponse(error=None)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to create collection: {e}", exc_info=True)
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="creation_error",
|
||||
message=str(e)
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, request):
|
||||
"""Delete the collection for FalkorDB triples"""
|
||||
try:
|
||||
# Delete all nodes and literals for this user/collection
|
||||
node_result = self.io.query(
|
||||
"MATCH (n:Node {user: $user, collection: $collection}) DETACH DELETE n",
|
||||
params={"user": message.user, "collection": message.collection}
|
||||
params={"user": request.user, "collection": request.collection}
|
||||
)
|
||||
|
||||
literal_result = self.io.query(
|
||||
"MATCH (n:Literal {user: $user, collection: $collection}) DETACH DELETE n",
|
||||
params={"user": message.user, "collection": message.collection}
|
||||
params={"user": request.user, "collection": request.collection}
|
||||
)
|
||||
|
||||
logger.info(f"Deleted {node_result.nodes_deleted} nodes and {literal_result.nodes_deleted} literals for collection {message.user}/{message.collection}")
|
||||
# Delete collection metadata node
|
||||
metadata_result = self.io.query(
|
||||
"MATCH (c:CollectionMetadata {user: $user, collection: $collection}) DELETE c",
|
||||
params={"user": request.user, "collection": request.collection}
|
||||
)
|
||||
|
||||
logger.info(f"Deleted {node_result.nodes_deleted} nodes, {literal_result.nodes_deleted} literals, and {metadata_result.nodes_deleted} metadata nodes for collection {request.user}/{request.collection}")
|
||||
|
||||
# Send success response
|
||||
response = StorageManagementResponse(
|
||||
error=None # No error means success
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
logger.info(f"Successfully deleted collection {message.user}/{message.collection}")
|
||||
logger.info(f"Successfully deleted collection {request.user}/{request.collection}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to delete collection: {e}")
|
||||
|
|
|
|||
|
|
@ -267,12 +267,43 @@ class Processor(TriplesStoreService):
|
|||
src=t.s.value, dest=t.o.value, uri=t.p.value, user=user, collection=collection,
|
||||
)
|
||||
|
||||
def collection_exists(self, user, collection):
|
||||
"""Check if collection metadata node exists"""
|
||||
with self.io.session(database=self.db) as session:
|
||||
result = session.run(
|
||||
"MATCH (c:CollectionMetadata {user: $user, collection: $collection}) "
|
||||
"RETURN c LIMIT 1",
|
||||
user=user, collection=collection
|
||||
)
|
||||
return bool(list(result))
|
||||
|
||||
def create_collection(self, user, collection):
|
||||
"""Create collection metadata node"""
|
||||
import datetime
|
||||
with self.io.session(database=self.db) as session:
|
||||
session.run(
|
||||
"MERGE (c:CollectionMetadata {user: $user, collection: $collection}) "
|
||||
"SET c.created_at = $created_at",
|
||||
user=user, collection=collection,
|
||||
created_at=datetime.datetime.now().isoformat()
|
||||
)
|
||||
logger.info(f"Created collection metadata node for {user}/{collection}")
|
||||
|
||||
async def store_triples(self, message):
|
||||
|
||||
# Extract user and collection from metadata
|
||||
user = message.metadata.user if message.metadata.user else "default"
|
||||
collection = message.metadata.collection if message.metadata.collection else "default"
|
||||
|
||||
# Validate collection exists before accepting writes
|
||||
if not self.collection_exists(user, collection):
|
||||
error_msg = (
|
||||
f"Collection {collection} does not exist. "
|
||||
f"Create it first with tg-set-collection."
|
||||
)
|
||||
logger.error(error_msg)
|
||||
raise ValueError(error_msg)
|
||||
|
||||
for t in message.triples:
|
||||
|
||||
self.create_node(t.s.value, user, collection)
|
||||
|
|
@ -317,18 +348,27 @@ class Processor(TriplesStoreService):
|
|||
help=f'Memgraph database (default: {default_database})'
|
||||
)
|
||||
|
||||
async def on_storage_management(self, message):
|
||||
async def start(self):
|
||||
"""Start the processor and its storage management consumer"""
|
||||
await super().start()
|
||||
await self.storage_request_consumer.start()
|
||||
await self.storage_response_producer.start()
|
||||
|
||||
async def on_storage_management(self, message, consumer, flow):
|
||||
"""Handle storage management requests"""
|
||||
logger.info(f"Storage management request: {message.operation} for {message.user}/{message.collection}")
|
||||
request = message.value()
|
||||
logger.info(f"Storage management request: {request.operation} for {request.user}/{request.collection}")
|
||||
|
||||
try:
|
||||
if message.operation == "delete-collection":
|
||||
await self.handle_delete_collection(message)
|
||||
if request.operation == "create-collection":
|
||||
await self.handle_create_collection(request)
|
||||
elif request.operation == "delete-collection":
|
||||
await self.handle_delete_collection(request)
|
||||
else:
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="invalid_operation",
|
||||
message=f"Unknown operation: {message.operation}"
|
||||
message=f"Unknown operation: {request.operation}"
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
|
@ -343,7 +383,30 @@ class Processor(TriplesStoreService):
|
|||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, message):
|
||||
async def handle_create_collection(self, request):
|
||||
"""Create collection metadata in Memgraph"""
|
||||
try:
|
||||
if self.collection_exists(request.user, request.collection):
|
||||
logger.info(f"Collection {request.user}/{request.collection} already exists")
|
||||
else:
|
||||
self.create_collection(request.user, request.collection)
|
||||
logger.info(f"Created collection {request.user}/{request.collection}")
|
||||
|
||||
# Send success response
|
||||
response = StorageManagementResponse(error=None)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to create collection: {e}", exc_info=True)
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="creation_error",
|
||||
message=str(e)
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, request):
|
||||
"""Delete all data for a specific collection"""
|
||||
try:
|
||||
with self.io.session(database=self.db) as session:
|
||||
|
|
@ -351,7 +414,7 @@ class Processor(TriplesStoreService):
|
|||
node_result = session.run(
|
||||
"MATCH (n:Node {user: $user, collection: $collection}) "
|
||||
"DETACH DELETE n",
|
||||
user=message.user, collection=message.collection
|
||||
user=request.user, collection=request.collection
|
||||
)
|
||||
nodes_deleted = node_result.consume().counters.nodes_deleted
|
||||
|
||||
|
|
@ -359,20 +422,28 @@ class Processor(TriplesStoreService):
|
|||
literal_result = session.run(
|
||||
"MATCH (n:Literal {user: $user, collection: $collection}) "
|
||||
"DETACH DELETE n",
|
||||
user=message.user, collection=message.collection
|
||||
user=request.user, collection=request.collection
|
||||
)
|
||||
literals_deleted = literal_result.consume().counters.nodes_deleted
|
||||
|
||||
# Delete collection metadata node
|
||||
metadata_result = session.run(
|
||||
"MATCH (c:CollectionMetadata {user: $user, collection: $collection}) "
|
||||
"DELETE c",
|
||||
user=request.user, collection=request.collection
|
||||
)
|
||||
metadata_deleted = metadata_result.consume().counters.nodes_deleted
|
||||
|
||||
# Note: Relationships are automatically deleted with DETACH DELETE
|
||||
|
||||
logger.info(f"Deleted {nodes_deleted} nodes and {literals_deleted} literals for {message.user}/{message.collection}")
|
||||
logger.info(f"Deleted {nodes_deleted} nodes, {literals_deleted} literals, and {metadata_deleted} metadata nodes for {request.user}/{request.collection}")
|
||||
|
||||
# Send success response
|
||||
response = StorageManagementResponse(
|
||||
error=None # No error means success
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
logger.info(f"Successfully deleted collection {message.user}/{message.collection}")
|
||||
logger.info(f"Successfully deleted collection {request.user}/{request.collection}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to delete collection: {e}")
|
||||
|
|
|
|||
|
|
@ -228,6 +228,15 @@ class Processor(TriplesStoreService):
|
|||
user = message.metadata.user if message.metadata.user else "default"
|
||||
collection = message.metadata.collection if message.metadata.collection else "default"
|
||||
|
||||
# Validate collection exists before accepting writes
|
||||
if not self.collection_exists(user, collection):
|
||||
error_msg = (
|
||||
f"Collection {collection} does not exist. "
|
||||
f"Create it first with tg-set-collection."
|
||||
)
|
||||
logger.error(error_msg)
|
||||
raise ValueError(error_msg)
|
||||
|
||||
for t in message.triples:
|
||||
|
||||
self.create_node(t.s.value, user, collection)
|
||||
|
|
@ -268,18 +277,27 @@ class Processor(TriplesStoreService):
|
|||
help=f'Neo4j database (default: {default_database})'
|
||||
)
|
||||
|
||||
async def on_storage_management(self, message):
|
||||
async def start(self):
|
||||
"""Start the processor and its storage management consumer"""
|
||||
await super().start()
|
||||
await self.storage_request_consumer.start()
|
||||
await self.storage_response_producer.start()
|
||||
|
||||
async def on_storage_management(self, message, consumer, flow):
|
||||
"""Handle storage management requests"""
|
||||
logger.info(f"Storage management request: {message.operation} for {message.user}/{message.collection}")
|
||||
request = message.value()
|
||||
logger.info(f"Storage management request: {request.operation} for {request.user}/{request.collection}")
|
||||
|
||||
try:
|
||||
if message.operation == "delete-collection":
|
||||
await self.handle_delete_collection(message)
|
||||
if request.operation == "create-collection":
|
||||
await self.handle_create_collection(request)
|
||||
elif request.operation == "delete-collection":
|
||||
await self.handle_delete_collection(request)
|
||||
else:
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="invalid_operation",
|
||||
message=f"Unknown operation: {message.operation}"
|
||||
message=f"Unknown operation: {request.operation}"
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
|
@ -294,7 +312,52 @@ class Processor(TriplesStoreService):
|
|||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, message):
|
||||
def collection_exists(self, user, collection):
|
||||
"""Check if collection metadata node exists"""
|
||||
with self.io.session(database=self.db) as session:
|
||||
result = session.run(
|
||||
"MATCH (c:CollectionMetadata {user: $user, collection: $collection}) "
|
||||
"RETURN c LIMIT 1",
|
||||
user=user, collection=collection
|
||||
)
|
||||
return bool(list(result))
|
||||
|
||||
def create_collection(self, user, collection):
|
||||
"""Create collection metadata node"""
|
||||
import datetime
|
||||
with self.io.session(database=self.db) as session:
|
||||
session.run(
|
||||
"MERGE (c:CollectionMetadata {user: $user, collection: $collection}) "
|
||||
"SET c.created_at = $created_at",
|
||||
user=user, collection=collection,
|
||||
created_at=datetime.datetime.now().isoformat()
|
||||
)
|
||||
logger.info(f"Created collection metadata node for {user}/{collection}")
|
||||
|
||||
async def handle_create_collection(self, request):
|
||||
"""Create collection metadata in Neo4j"""
|
||||
try:
|
||||
if self.collection_exists(request.user, request.collection):
|
||||
logger.info(f"Collection {request.user}/{request.collection} already exists")
|
||||
else:
|
||||
self.create_collection(request.user, request.collection)
|
||||
logger.info(f"Created collection {request.user}/{request.collection}")
|
||||
|
||||
# Send success response
|
||||
response = StorageManagementResponse(error=None)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to create collection: {e}", exc_info=True)
|
||||
response = StorageManagementResponse(
|
||||
error=Error(
|
||||
type="creation_error",
|
||||
message=str(e)
|
||||
)
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
|
||||
async def handle_delete_collection(self, request):
|
||||
"""Delete all data for a specific collection"""
|
||||
try:
|
||||
with self.io.session(database=self.db) as session:
|
||||
|
|
@ -302,7 +365,7 @@ class Processor(TriplesStoreService):
|
|||
node_result = session.run(
|
||||
"MATCH (n:Node {user: $user, collection: $collection}) "
|
||||
"DETACH DELETE n",
|
||||
user=message.user, collection=message.collection
|
||||
user=request.user, collection=request.collection
|
||||
)
|
||||
nodes_deleted = node_result.consume().counters.nodes_deleted
|
||||
|
||||
|
|
@ -310,20 +373,28 @@ class Processor(TriplesStoreService):
|
|||
literal_result = session.run(
|
||||
"MATCH (n:Literal {user: $user, collection: $collection}) "
|
||||
"DETACH DELETE n",
|
||||
user=message.user, collection=message.collection
|
||||
user=request.user, collection=request.collection
|
||||
)
|
||||
literals_deleted = literal_result.consume().counters.nodes_deleted
|
||||
|
||||
# Note: Relationships are automatically deleted with DETACH DELETE
|
||||
|
||||
logger.info(f"Deleted {nodes_deleted} nodes and {literals_deleted} literals for {message.user}/{message.collection}")
|
||||
# Delete collection metadata node
|
||||
metadata_result = session.run(
|
||||
"MATCH (c:CollectionMetadata {user: $user, collection: $collection}) "
|
||||
"DELETE c",
|
||||
user=request.user, collection=request.collection
|
||||
)
|
||||
metadata_deleted = metadata_result.consume().counters.nodes_deleted
|
||||
|
||||
logger.info(f"Deleted {nodes_deleted} nodes, {literals_deleted} literals, and {metadata_deleted} metadata nodes for {request.user}/{request.collection}")
|
||||
|
||||
# Send success response
|
||||
response = StorageManagementResponse(
|
||||
error=None # No error means success
|
||||
)
|
||||
await self.storage_response_producer.send(response)
|
||||
logger.info(f"Successfully deleted collection {message.user}/{message.collection}")
|
||||
logger.info(f"Successfully deleted collection {request.user}/{request.collection}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to delete collection: {e}")
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue