# Structured Data Technical Specification (Part 2) ## Overview This specification addresses issues and gaps identified during the initial implementation of TrustGraph's structured data integration, as described in `structured-data.md`. ## Problem Statements ### 1. Naming Inconsistency: "Object" vs "Row" The current implementation uses "object" terminology throughout (e.g., `ExtractedObject`, object extraction, object embeddings). This naming is too generic and causes confusion: - "Object" is an overloaded term in software (Python objects, JSON objects, etc.) - The data being handled is fundamentally tabular - rows in tables with defined schemas - "Row" more accurately describes the data model and aligns with database terminology This inconsistency appears in module names, class names, message types, and documentation. ### 2. Row Store Query Limitations The current row store implementation has significant query limitations: **Natural Language Mismatch**: Queries struggle with real-world data variations. For example: - A street database containing `"CHESTNUT ST"` is difficult to find when asking about `"Chestnut Street"` - Abbreviations, case differences, and formatting variations break exact-match queries - Users expect semantic understanding, but the store provides literal matching **Schema Evolution Issues**: Changing schemas causes problems: - Existing data may not conform to updated schemas - Table structure changes can break queries and data integrity - No clear migration path for schema updates ### 3. Row Embeddings Required Related to problem 2, the system needs vector embeddings for row data to enable: - Semantic search across structured data (finding "Chestnut Street" when data contains "CHESTNUT ST") - Similarity matching for fuzzy queries - Hybrid search combining structured filters with semantic similarity - Better natural language query support The embedding service was specified but not implemented. ### 4. Row Data Ingestion Incomplete The structured data ingestion pipeline is not fully operational: - Diagnostic prompts exist to classify input formats (CSV, JSON, etc.) - The ingestion service that uses these prompts is not plumbed into the system - No end-to-end path for loading pre-structured data into the row store ## Goals - **Schema Flexibility**: Enable schema evolution without breaking existing data or requiring migrations - **Consistent Naming**: Standardize on "row" terminology throughout the codebase - **Semantic Queryability**: Support fuzzy/semantic matching via row embeddings - **Complete Ingestion Pipeline**: Provide end-to-end path for loading structured data ## Technical Design ### Unified Row Storage Schema The previous implementation created a separate Cassandra table for each schema. This caused problems when schemas evolved, as table structure changes required migrations. The new design uses a single unified table for all row data: ```sql CREATE TABLE rows ( collection text, schema_name text, index_name text, index_value frozen>, data map, source text, PRIMARY KEY ((collection, schema_name, index_name), index_value) ) ``` #### Column Definitions | Column | Type | Description | |--------|------|-------------| | `collection` | `text` | Data collection/import identifier (from metadata) | | `schema_name` | `text` | Name of the schema this row conforms to | | `index_name` | `text` | Name of the indexed field(s), comma-joined for composites | | `index_value` | `frozen>` | Index value(s) as a list | | `data` | `map` | Row data as key-value pairs | | `source` | `text` | Optional URI linking to provenance information in the knowledge graph. Empty string or NULL indicates no source. | #### Index Handling Each row is stored multiple times - once per indexed field defined in the schema. The primary key fields are treated as an index with no special marker, providing future flexibility. **Single-field index example:** - Schema defines `email` as indexed - `index_name = "email"` - `index_value = ['foo@bar.com']` **Composite index example:** - Schema defines composite index on `region` and `status` - `index_name = "region,status"` (field names sorted and comma-joined) - `index_value = ['US', 'active']` (values in same order as field names) **Primary key example:** - Schema defines `customer_id` as primary key - `index_name = "customer_id"` - `index_value = ['CUST001']` #### Query Patterns All queries follow the same pattern regardless of which index is used: ```sql SELECT * FROM rows WHERE collection = 'import_2024' AND schema_name = 'customers' AND index_name = 'email' AND index_value = ['foo@bar.com'] ``` #### Design Trade-offs **Advantages:** - Schema changes don't require table structure changes - Row data is opaque to Cassandra - field additions/removals are transparent - Consistent query pattern for all access methods - No Cassandra secondary indexes (which can be slow at scale) - Native Cassandra types throughout (`map`, `frozen`) **Trade-offs:** - Write amplification: each row insert = N inserts (one per indexed field) - Storage overhead from duplicated row data - Type information stored in schema config, conversion at application layer #### Consistency Model The design accepts certain simplifications: 1. **No row updates**: The system is append-only. This eliminates consistency concerns about updating multiple copies of the same row. 2. **Schema change tolerance**: When schemas change (e.g., indexes added/removed), existing rows retain their original indexing. Old rows won't be discoverable via new indexes. Users can delete and recreate a schema to ensure consistency if needed. ### Partition Tracking and Deletion #### The Problem With the partition key `(collection, schema_name, index_name)`, efficient deletion requires knowing all partition keys to delete. Deleting by just `collection` or `collection + schema_name` requires knowing all the `index_name` values that have data. #### Partition Tracking Table A secondary lookup table tracks which partitions exist: ```sql CREATE TABLE row_partitions ( collection text, schema_name text, index_name text, PRIMARY KEY ((collection), schema_name, index_name) ) ``` This enables efficient discovery of partitions for deletion operations. #### Row Writer Behavior The row writer maintains an in-memory cache of registered `(collection, schema_name)` pairs. When processing a row: 1. Check if `(collection, schema_name)` is in the cache 2. If not cached (first row for this pair): - Look up the schema config to get all index names - Insert entries into `row_partitions` for each `(collection, schema_name, index_name)` - Add the pair to the cache 3. Proceed with writing the row data The row writer also monitors schema config change events. When a schema changes, relevant cache entries are cleared so the next row triggers re-registration with the updated index names. This approach ensures: - Lookup table writes happen once per `(collection, schema_name)` pair, not per row - The lookup table reflects the indexes that were active when data was written - Schema changes mid-import are picked up correctly #### Deletion Operations **Delete collection:** ```sql -- 1. Discover all partitions SELECT schema_name, index_name FROM row_partitions WHERE collection = 'X'; -- 2. Delete each partition from rows table DELETE FROM rows WHERE collection = 'X' AND schema_name = '...' AND index_name = '...'; -- (repeat for each discovered partition) -- 3. Clean up the lookup table DELETE FROM row_partitions WHERE collection = 'X'; ``` **Delete collection + schema:** ```sql -- 1. Discover partitions for this schema SELECT index_name FROM row_partitions WHERE collection = 'X' AND schema_name = 'Y'; -- 2. Delete each partition from rows table DELETE FROM rows WHERE collection = 'X' AND schema_name = 'Y' AND index_name = '...'; -- (repeat for each discovered partition) -- 3. Clean up the lookup table entries DELETE FROM row_partitions WHERE collection = 'X' AND schema_name = 'Y'; ``` ### Row Embeddings Row embeddings enable semantic/fuzzy matching on indexed values, solving the natural language mismatch problem (e.g., finding "CHESTNUT ST" when querying for "Chestnut Street"). #### Design Overview Each indexed value is embedded and stored in a vector store (Qdrant). At query time, the query is embedded, similar vectors are found, and the associated metadata is used to look up the actual rows in Cassandra. #### Qdrant Collection Structure One Qdrant collection per `(user, collection, schema_name, dimension)` tuple: - **Collection naming:** `rows_{user}_{collection}_{schema_name}_{dimension}` - Names are sanitized (non-alphanumeric characters replaced with `_`, lowercased, numeric prefixes get `r_` prefix) - **Rationale:** Enables clean deletion of a `(user, collection, schema_name)` instance by dropping matching Qdrant collections; dimension suffix allows different embedding models to coexist #### What Gets Embedded The text representation of index values: | Index Type | Example `index_value` | Text to Embed | |------------|----------------------|---------------| | Single-field | `['foo@bar.com']` | `"foo@bar.com"` | | Composite | `['US', 'active']` | `"US active"` (space-joined) | #### Point Structure Each Qdrant point contains: ```json { "id": "", "vector": [0.1, 0.2, ...], "payload": { "index_name": "street_name", "index_value": ["CHESTNUT ST"], "text": "CHESTNUT ST" } } ``` | Payload Field | Description | |---------------|-------------| | `index_name` | The indexed field(s) this embedding represents | | `index_value` | The original list of values (for Cassandra lookup) | | `text` | The text that was embedded (for debugging/display) | Note: `user`, `collection`, and `schema_name` are implicit from the Qdrant collection name. #### Query Flow 1. User queries for "Chestnut Street" within user U, collection X, schema Y 2. Embed the query text 3. Determine Qdrant collection name(s) matching prefix `rows_U_X_Y_` 4. Search matching Qdrant collection(s) for nearest vectors 5. Get matching points with payloads containing `index_name` and `index_value` 6. Query Cassandra: ```sql SELECT * FROM rows WHERE collection = 'X' AND schema_name = 'Y' AND index_name = '' AND index_value = ``` 7. Return matched rows #### Optional: Filtering by Index Name Queries can optionally filter by `index_name` in Qdrant to search only specific fields: - **"Find any field matching 'Chestnut'"** → search all vectors in the collection - **"Find street_name matching 'Chestnut'"** → filter where `payload.index_name = 'street_name'` #### Architecture Row embeddings follow the **two-stage pattern** used by GraphRAG (graph-embeddings, document-embeddings): - **Stage 1: Embedding computation** (`trustgraph-flow/trustgraph/embeddings/row_embeddings/`) - Consumes `ExtractedObject`, computes embeddings via the embeddings service, outputs `RowEmbeddings` - **Stage 2: Embedding storage** (`trustgraph-flow/trustgraph/storage/row_embeddings/qdrant/`) - Consumes `RowEmbeddings`, writes vectors to Qdrant The Cassandra row writer is a separate parallel consumer: - **Cassandra row writer** (`trustgraph-flow/trustgraph/storage/rows/cassandra`) - Consumes `ExtractedObject`, writes rows to Cassandra All three services consume from the same flow, keeping them decoupled. This allows: - Independent scaling of Cassandra writes vs embedding generation vs vector storage - Embedding services can be disabled if not needed - Failures in one service don't affect the others - Consistent architecture with GraphRAG pipelines #### Write Path **Stage 1 (row-embeddings processor):** When receiving an `ExtractedObject`: 1. Look up the schema to find indexed fields 2. For each indexed field: - Build the text representation of the index value - Compute embedding via the embeddings service 3. Output a `RowEmbeddings` message containing all computed vectors **Stage 2 (row-embeddings-write-qdrant):** When receiving a `RowEmbeddings`: 1. For each embedding in the message: - Determine Qdrant collection from `(user, collection, schema_name, dimension)` - Create collection if needed (lazy creation on first write) - Upsert point with vector and payload #### Message Types ```python @dataclass class RowIndexEmbedding: index_name: str # The indexed field name(s) index_value: list[str] # The field value(s) text: str # Text that was embedded vectors: list[list[float]] # Computed embedding vectors @dataclass class RowEmbeddings: metadata: Metadata schema_name: str embeddings: list[RowIndexEmbedding] ``` #### Deletion Integration Qdrant collections are discovered by prefix matching on the collection name pattern: **Delete `(user, collection)`:** 1. List all Qdrant collections matching prefix `rows_{user}_{collection}_` 2. Delete each matching collection 3. Delete Cassandra rows partitions (as documented above) 4. Clean up `row_partitions` entries **Delete `(user, collection, schema_name)`:** 1. List all Qdrant collections matching prefix `rows_{user}_{collection}_{schema_name}_` 2. Delete each matching collection (handles multiple dimensions) 3. Delete Cassandra rows partitions 4. Clean up `row_partitions` #### Module Locations | Stage | Module | Entry Point | |-------|--------|-------------| | Stage 1 | `trustgraph-flow/trustgraph/embeddings/row_embeddings/` | `row-embeddings` | | Stage 2 | `trustgraph-flow/trustgraph/storage/row_embeddings/qdrant/` | `row-embeddings-write-qdrant` | ### Row Embeddings Query API The row embeddings query is a **separate API** from the GraphQL row query service: | API | Purpose | Backend | |-----|---------|---------| | Row Query (GraphQL) | Exact matching on indexed fields | Cassandra | | Row Embeddings Query | Fuzzy/semantic matching | Qdrant | This separation keeps concerns clean: - GraphQL service focuses on exact, structured queries - Embeddings API handles semantic similarity - User workflow: fuzzy search via embeddings to find candidates, then exact query to get full row data #### Request/Response Schema ```python @dataclass class RowEmbeddingsRequest: vectors: list[list[float]] # Query vectors (pre-computed embeddings) user: str = "" collection: str = "" schema_name: str = "" index_name: str = "" # Optional: filter to specific index limit: int = 10 # Max results per vector @dataclass class RowIndexMatch: index_name: str = "" # The matched index field(s) index_value: list[str] = [] # The matched value(s) text: str = "" # Original text that was embedded score: float = 0.0 # Similarity score @dataclass class RowEmbeddingsResponse: error: Error | None = None matches: list[RowIndexMatch] = [] ``` #### Query Processor Module: `trustgraph-flow/trustgraph/query/row_embeddings/qdrant` Entry point: `row-embeddings-query-qdrant` The processor: 1. Receives `RowEmbeddingsRequest` with query vectors 2. Finds the appropriate Qdrant collection by prefix matching 3. Searches for nearest vectors with optional `index_name` filter 4. Returns `RowEmbeddingsResponse` with matching index information #### API Gateway Integration The gateway exposes row embeddings queries via the standard request/response pattern: | Component | Location | |-----------|----------| | Dispatcher | `trustgraph-flow/trustgraph/gateway/dispatch/row_embeddings_query.py` | | Registration | Add `"row-embeddings"` to `request_response_dispatchers` in `manager.py` | Flow interface name: `row-embeddings` Interface definition in flow blueprint: ```json { "interfaces": { "row-embeddings": { "request": "non-persistent://tg/request/row-embeddings:{id}", "response": "non-persistent://tg/response/row-embeddings:{id}" } } } ``` #### Python SDK Support The SDK provides methods for row embeddings queries: ```python # Flow-scoped query (preferred) api = Api(url) flow = api.flow().id("default") # Query with text (SDK computes embeddings) matches = flow.row_embeddings_query( text="Chestnut Street", collection="my_collection", schema_name="addresses", index_name="street_name", # Optional filter limit=10 ) # Query with pre-computed vectors matches = flow.row_embeddings_query( vectors=[[0.1, 0.2, ...]], collection="my_collection", schema_name="addresses" ) # Each match contains: for match in matches: print(match.index_name) # e.g., "street_name" print(match.index_value) # e.g., ["CHESTNUT ST"] print(match.text) # e.g., "CHESTNUT ST" print(match.score) # e.g., 0.95 ``` #### CLI Utility Command: `tg-invoke-row-embeddings` ```bash # Query by text (computes embedding automatically) tg-invoke-row-embeddings \ --text "Chestnut Street" \ --collection my_collection \ --schema addresses \ --index street_name \ --limit 10 # Query by vector file tg-invoke-row-embeddings \ --vectors vectors.json \ --collection my_collection \ --schema addresses # Output formats tg-invoke-row-embeddings --text "..." --format json tg-invoke-row-embeddings --text "..." --format table ``` #### Typical Usage Pattern The row embeddings query is typically used as part of a fuzzy-to-exact lookup flow: ```python # Step 1: Fuzzy search via embeddings matches = flow.row_embeddings_query( text="chestnut street", collection="geo", schema_name="streets" ) # Step 2: Exact lookup via GraphQL for full row data for match in matches: query = f''' query {{ streets(where: {{ {match.index_name}: {{ eq: "{match.index_value[0]}" }} }}) {{ street_name city zip_code }} }} ''' rows = flow.rows_query(query, collection="geo") ``` This two-step pattern enables: - Finding "CHESTNUT ST" when user searches for "Chestnut Street" - Retrieving complete row data with all fields - Combining semantic similarity with structured data access ### Row Data Ingestion Deferred to a subsequent phase. Will be designed alongside other ingestion changes. ## Implementation Impact ### Current State Analysis The existing implementation has two main components: | Component | Location | Lines | Description | |-----------|----------|-------|-------------| | Query Service | `trustgraph-flow/trustgraph/query/objects/cassandra/service.py` | ~740 | Monolithic: GraphQL schema generation, filter parsing, Cassandra queries, request handling | | Writer | `trustgraph-flow/trustgraph/storage/objects/cassandra/write.py` | ~540 | Per-schema table creation, secondary indexes, insert/delete | **Current Query Pattern:** ```sql SELECT * FROM {keyspace}.o_{schema_name} WHERE collection = 'X' AND email = 'foo@bar.com' ALLOW FILTERING ``` **New Query Pattern:** ```sql SELECT * FROM {keyspace}.rows WHERE collection = 'X' AND schema_name = 'customers' AND index_name = 'email' AND index_value = ['foo@bar.com'] ``` ### Key Changes 1. **Query semantics simplify**: The new schema only supports exact matches on `index_value`. The current GraphQL filters (`gt`, `lt`, `contains`, etc.) either: - Become post-filtering on returned data (if still needed) - Are removed in favor of using the embeddings API for fuzzy matching 2. **GraphQL code is tightly coupled**: The current `service.py` bundles Strawberry type generation, filter parsing, and Cassandra-specific queries. Adding another row store backend would duplicate ~400 lines of GraphQL code. ### Proposed Refactor The refactor has two parts: #### 1. Break Out GraphQL Code Extract reusable GraphQL components into a shared module: ``` trustgraph-flow/trustgraph/query/graphql/ ├── __init__.py ├── types.py # Filter types (IntFilter, StringFilter, FloatFilter) ├── schema.py # Dynamic schema generation from RowSchema └── filters.py # Filter parsing utilities ``` This enables: - Reuse across different row store backends - Cleaner separation of concerns - Easier testing of GraphQL logic independently #### 2. Implement New Table Schema Refactor the Cassandra-specific code to use the unified table: **Writer** (`trustgraph-flow/trustgraph/storage/rows/cassandra/`): - Single `rows` table instead of per-schema tables - Write N copies per row (one per index) - Register to `row_partitions` table - Simpler table creation (one-time setup) **Query Service** (`trustgraph-flow/trustgraph/query/rows/cassandra/`): - Query the unified `rows` table - Use extracted GraphQL module for schema generation - Simplified filter handling (exact match only at DB level) ### Module Renames As part of the "object" → "row" naming cleanup: | Current | New | |---------|-----| | `storage/objects/cassandra/` | `storage/rows/cassandra/` | | `query/objects/cassandra/` | `query/rows/cassandra/` | | `embeddings/object_embeddings/` | `embeddings/row_embeddings/` | ### New Modules | Module | Purpose | |--------|---------| | `trustgraph-flow/trustgraph/query/graphql/` | Shared GraphQL utilities | | `trustgraph-flow/trustgraph/query/row_embeddings/qdrant/` | Row embeddings query API | | `trustgraph-flow/trustgraph/embeddings/row_embeddings/` | Row embeddings computation (Stage 1) | | `trustgraph-flow/trustgraph/storage/row_embeddings/qdrant/` | Row embeddings storage (Stage 2) | ## References - [Structured Data Technical Specification](structured-data.md)