trustgraph/docs/tech-specs/structured-data-2.md

620 lines
21 KiB
Markdown
Raw Normal View History

---
layout: default
title: "Structured Data Technical Specification (Part 2)"
parent: "Tech Specs"
---
# Structured Data Technical Specification (Part 2)
## Overview
This specification addresses issues and gaps identified during the initial implementation of TrustGraph's structured data integration, as described in `structured-data.md`.
## Problem Statements
### 1. Naming Inconsistency: "Object" vs "Row"
The current implementation uses "object" terminology throughout (e.g., `ExtractedObject`, object extraction, object embeddings). This naming is too generic and causes confusion:
- "Object" is an overloaded term in software (Python objects, JSON objects, etc.)
- The data being handled is fundamentally tabular - rows in tables with defined schemas
- "Row" more accurately describes the data model and aligns with database terminology
This inconsistency appears in module names, class names, message types, and documentation.
### 2. Row Store Query Limitations
The current row store implementation has significant query limitations:
**Natural Language Mismatch**: Queries struggle with real-world data variations. For example:
- A street database containing `"CHESTNUT ST"` is difficult to find when asking about `"Chestnut Street"`
- Abbreviations, case differences, and formatting variations break exact-match queries
- Users expect semantic understanding, but the store provides literal matching
**Schema Evolution Issues**: Changing schemas causes problems:
- Existing data may not conform to updated schemas
- Table structure changes can break queries and data integrity
- No clear migration path for schema updates
### 3. Row Embeddings Required
Related to problem 2, the system needs vector embeddings for row data to enable:
- Semantic search across structured data (finding "Chestnut Street" when data contains "CHESTNUT ST")
- Similarity matching for fuzzy queries
- Hybrid search combining structured filters with semantic similarity
- Better natural language query support
The embedding service was specified but not implemented.
### 4. Row Data Ingestion Incomplete
The structured data ingestion pipeline is not fully operational:
- Diagnostic prompts exist to classify input formats (CSV, JSON, etc.)
- The ingestion service that uses these prompts is not plumbed into the system
- No end-to-end path for loading pre-structured data into the row store
## Goals
- **Schema Flexibility**: Enable schema evolution without breaking existing data or requiring migrations
- **Consistent Naming**: Standardize on "row" terminology throughout the codebase
- **Semantic Queryability**: Support fuzzy/semantic matching via row embeddings
- **Complete Ingestion Pipeline**: Provide end-to-end path for loading structured data
## Technical Design
### Unified Row Storage Schema
The previous implementation created a separate Cassandra table for each schema. This caused problems when schemas evolved, as table structure changes required migrations.
The new design uses a single unified table for all row data:
```sql
CREATE TABLE rows (
collection text,
schema_name text,
index_name text,
index_value frozen<list<text>>,
data map<text, text>,
source text,
PRIMARY KEY ((collection, schema_name, index_name), index_value)
)
```
#### Column Definitions
| Column | Type | Description |
|--------|------|-------------|
| `collection` | `text` | Data collection/import identifier (from metadata) |
| `schema_name` | `text` | Name of the schema this row conforms to |
| `index_name` | `text` | Name of the indexed field(s), comma-joined for composites |
| `index_value` | `frozen<list<text>>` | Index value(s) as a list |
| `data` | `map<text, text>` | Row data as key-value pairs |
| `source` | `text` | Optional URI linking to provenance information in the knowledge graph. Empty string or NULL indicates no source. |
#### Index Handling
Each row is stored multiple times - once per indexed field defined in the schema. The primary key fields are treated as an index with no special marker, providing future flexibility.
**Single-field index example:**
- Schema defines `email` as indexed
- `index_name = "email"`
- `index_value = ['foo@bar.com']`
**Composite index example:**
- Schema defines composite index on `region` and `status`
- `index_name = "region,status"` (field names sorted and comma-joined)
- `index_value = ['US', 'active']` (values in same order as field names)
**Primary key example:**
- Schema defines `customer_id` as primary key
- `index_name = "customer_id"`
- `index_value = ['CUST001']`
#### Query Patterns
All queries follow the same pattern regardless of which index is used:
```sql
SELECT * FROM rows
WHERE collection = 'import_2024'
AND schema_name = 'customers'
AND index_name = 'email'
AND index_value = ['foo@bar.com']
```
#### Design Trade-offs
**Advantages:**
- Schema changes don't require table structure changes
- Row data is opaque to Cassandra - field additions/removals are transparent
- Consistent query pattern for all access methods
- No Cassandra secondary indexes (which can be slow at scale)
- Native Cassandra types throughout (`map`, `frozen<list>`)
**Trade-offs:**
- Write amplification: each row insert = N inserts (one per indexed field)
- Storage overhead from duplicated row data
- Type information stored in schema config, conversion at application layer
#### Consistency Model
The design accepts certain simplifications:
1. **No row updates**: The system is append-only. This eliminates consistency concerns about updating multiple copies of the same row.
2. **Schema change tolerance**: When schemas change (e.g., indexes added/removed), existing rows retain their original indexing. Old rows won't be discoverable via new indexes. Users can delete and recreate a schema to ensure consistency if needed.
### Partition Tracking and Deletion
#### The Problem
With the partition key `(collection, schema_name, index_name)`, efficient deletion requires knowing all partition keys to delete. Deleting by just `collection` or `collection + schema_name` requires knowing all the `index_name` values that have data.
#### Partition Tracking Table
A secondary lookup table tracks which partitions exist:
```sql
CREATE TABLE row_partitions (
collection text,
schema_name text,
index_name text,
PRIMARY KEY ((collection), schema_name, index_name)
)
```
This enables efficient discovery of partitions for deletion operations.
#### Row Writer Behavior
The row writer maintains an in-memory cache of registered `(collection, schema_name)` pairs. When processing a row:
1. Check if `(collection, schema_name)` is in the cache
2. If not cached (first row for this pair):
- Look up the schema config to get all index names
- Insert entries into `row_partitions` for each `(collection, schema_name, index_name)`
- Add the pair to the cache
3. Proceed with writing the row data
The row writer also monitors schema config change events. When a schema changes, relevant cache entries are cleared so the next row triggers re-registration with the updated index names.
This approach ensures:
- Lookup table writes happen once per `(collection, schema_name)` pair, not per row
- The lookup table reflects the indexes that were active when data was written
- Schema changes mid-import are picked up correctly
#### Deletion Operations
**Delete collection:**
```sql
-- 1. Discover all partitions
SELECT schema_name, index_name FROM row_partitions WHERE collection = 'X';
-- 2. Delete each partition from rows table
DELETE FROM rows WHERE collection = 'X' AND schema_name = '...' AND index_name = '...';
-- (repeat for each discovered partition)
-- 3. Clean up the lookup table
DELETE FROM row_partitions WHERE collection = 'X';
```
**Delete collection + schema:**
```sql
-- 1. Discover partitions for this schema
SELECT index_name FROM row_partitions WHERE collection = 'X' AND schema_name = 'Y';
-- 2. Delete each partition from rows table
DELETE FROM rows WHERE collection = 'X' AND schema_name = 'Y' AND index_name = '...';
-- (repeat for each discovered partition)
-- 3. Clean up the lookup table entries
DELETE FROM row_partitions WHERE collection = 'X' AND schema_name = 'Y';
```
### Row Embeddings
Row embeddings enable semantic/fuzzy matching on indexed values, solving the natural language mismatch problem (e.g., finding "CHESTNUT ST" when querying for "Chestnut Street").
#### Design Overview
Each indexed value is embedded and stored in a vector store (Qdrant). At query time, the query is embedded, similar vectors are found, and the associated metadata is used to look up the actual rows in Cassandra.
#### Qdrant Collection Structure
One Qdrant collection per `(user, collection, schema_name, dimension)` tuple:
- **Collection naming:** `rows_{user}_{collection}_{schema_name}_{dimension}`
- Names are sanitized (non-alphanumeric characters replaced with `_`, lowercased, numeric prefixes get `r_` prefix)
- **Rationale:** Enables clean deletion of a `(user, collection, schema_name)` instance by dropping matching Qdrant collections; dimension suffix allows different embedding models to coexist
#### What Gets Embedded
The text representation of index values:
| Index Type | Example `index_value` | Text to Embed |
|------------|----------------------|---------------|
| Single-field | `['foo@bar.com']` | `"foo@bar.com"` |
| Composite | `['US', 'active']` | `"US active"` (space-joined) |
#### Point Structure
Each Qdrant point contains:
```json
{
"id": "<uuid>",
"vector": [0.1, 0.2, ...],
"payload": {
"index_name": "street_name",
"index_value": ["CHESTNUT ST"],
"text": "CHESTNUT ST"
}
}
```
| Payload Field | Description |
|---------------|-------------|
| `index_name` | The indexed field(s) this embedding represents |
| `index_value` | The original list of values (for Cassandra lookup) |
| `text` | The text that was embedded (for debugging/display) |
Note: `user`, `collection`, and `schema_name` are implicit from the Qdrant collection name.
#### Query Flow
1. User queries for "Chestnut Street" within user U, collection X, schema Y
2. Embed the query text
3. Determine Qdrant collection name(s) matching prefix `rows_U_X_Y_`
4. Search matching Qdrant collection(s) for nearest vectors
5. Get matching points with payloads containing `index_name` and `index_value`
6. Query Cassandra:
```sql
SELECT * FROM rows
WHERE collection = 'X'
AND schema_name = 'Y'
AND index_name = '<from payload>'
AND index_value = <from payload>
```
7. Return matched rows
#### Optional: Filtering by Index Name
Queries can optionally filter by `index_name` in Qdrant to search only specific fields:
- **"Find any field matching 'Chestnut'"** → search all vectors in the collection
- **"Find street_name matching 'Chestnut'"** → filter where `payload.index_name = 'street_name'`
#### Architecture
Row embeddings follow the **two-stage pattern** used by GraphRAG (graph-embeddings, document-embeddings):
- **Stage 1: Embedding computation** (`trustgraph-flow/trustgraph/embeddings/row_embeddings/`) - Consumes `ExtractedObject`, computes embeddings via the embeddings service, outputs `RowEmbeddings`
- **Stage 2: Embedding storage** (`trustgraph-flow/trustgraph/storage/row_embeddings/qdrant/`) - Consumes `RowEmbeddings`, writes vectors to Qdrant
The Cassandra row writer is a separate parallel consumer:
- **Cassandra row writer** (`trustgraph-flow/trustgraph/storage/rows/cassandra`) - Consumes `ExtractedObject`, writes rows to Cassandra
All three services consume from the same flow, keeping them decoupled. This allows:
- Independent scaling of Cassandra writes vs embedding generation vs vector storage
- Embedding services can be disabled if not needed
- Failures in one service don't affect the others
- Consistent architecture with GraphRAG pipelines
#### Write Path
**Stage 1 (row-embeddings processor):** When receiving an `ExtractedObject`:
1. Look up the schema to find indexed fields
2. For each indexed field:
- Build the text representation of the index value
- Compute embedding via the embeddings service
3. Output a `RowEmbeddings` message containing all computed vectors
**Stage 2 (row-embeddings-write-qdrant):** When receiving a `RowEmbeddings`:
1. For each embedding in the message:
- Determine Qdrant collection from `(user, collection, schema_name, dimension)`
- Create collection if needed (lazy creation on first write)
- Upsert point with vector and payload
#### Message Types
```python
@dataclass
class RowIndexEmbedding:
index_name: str # The indexed field name(s)
index_value: list[str] # The field value(s)
text: str # Text that was embedded
vectors: list[list[float]] # Computed embedding vectors
@dataclass
class RowEmbeddings:
metadata: Metadata
schema_name: str
embeddings: list[RowIndexEmbedding]
```
#### Deletion Integration
Qdrant collections are discovered by prefix matching on the collection name pattern:
**Delete `(user, collection)`:**
1. List all Qdrant collections matching prefix `rows_{user}_{collection}_`
2. Delete each matching collection
3. Delete Cassandra rows partitions (as documented above)
4. Clean up `row_partitions` entries
**Delete `(user, collection, schema_name)`:**
1. List all Qdrant collections matching prefix `rows_{user}_{collection}_{schema_name}_`
2. Delete each matching collection (handles multiple dimensions)
3. Delete Cassandra rows partitions
4. Clean up `row_partitions`
#### Module Locations
| Stage | Module | Entry Point |
|-------|--------|-------------|
| Stage 1 | `trustgraph-flow/trustgraph/embeddings/row_embeddings/` | `row-embeddings` |
| Stage 2 | `trustgraph-flow/trustgraph/storage/row_embeddings/qdrant/` | `row-embeddings-write-qdrant` |
### Row Embeddings Query API
The row embeddings query is a **separate API** from the GraphQL row query service:
| API | Purpose | Backend |
|-----|---------|---------|
| Row Query (GraphQL) | Exact matching on indexed fields | Cassandra |
| Row Embeddings Query | Fuzzy/semantic matching | Qdrant |
This separation keeps concerns clean:
- GraphQL service focuses on exact, structured queries
- Embeddings API handles semantic similarity
- User workflow: fuzzy search via embeddings to find candidates, then exact query to get full row data
#### Request/Response Schema
```python
@dataclass
class RowEmbeddingsRequest:
vectors: list[list[float]] # Query vectors (pre-computed embeddings)
user: str = ""
collection: str = ""
schema_name: str = ""
index_name: str = "" # Optional: filter to specific index
limit: int = 10 # Max results per vector
@dataclass
class RowIndexMatch:
index_name: str = "" # The matched index field(s)
index_value: list[str] = [] # The matched value(s)
text: str = "" # Original text that was embedded
score: float = 0.0 # Similarity score
@dataclass
class RowEmbeddingsResponse:
error: Error | None = None
matches: list[RowIndexMatch] = []
```
#### Query Processor
Module: `trustgraph-flow/trustgraph/query/row_embeddings/qdrant`
Entry point: `row-embeddings-query-qdrant`
The processor:
1. Receives `RowEmbeddingsRequest` with query vectors
2. Finds the appropriate Qdrant collection by prefix matching
3. Searches for nearest vectors with optional `index_name` filter
4. Returns `RowEmbeddingsResponse` with matching index information
#### API Gateway Integration
The gateway exposes row embeddings queries via the standard request/response pattern:
| Component | Location |
|-----------|----------|
| Dispatcher | `trustgraph-flow/trustgraph/gateway/dispatch/row_embeddings_query.py` |
| Registration | Add `"row-embeddings"` to `request_response_dispatchers` in `manager.py` |
Flow interface name: `row-embeddings`
Interface definition in flow blueprint:
```json
{
"interfaces": {
"row-embeddings": {
"request": "non-persistent://tg/request/row-embeddings:{id}",
"response": "non-persistent://tg/response/row-embeddings:{id}"
}
}
}
```
#### Python SDK Support
The SDK provides methods for row embeddings queries:
```python
# Flow-scoped query (preferred)
api = Api(url)
flow = api.flow().id("default")
# Query with text (SDK computes embeddings)
matches = flow.row_embeddings_query(
text="Chestnut Street",
collection="my_collection",
schema_name="addresses",
index_name="street_name", # Optional filter
limit=10
)
# Query with pre-computed vectors
matches = flow.row_embeddings_query(
vectors=[[0.1, 0.2, ...]],
collection="my_collection",
schema_name="addresses"
)
# Each match contains:
for match in matches:
print(match.index_name) # e.g., "street_name"
print(match.index_value) # e.g., ["CHESTNUT ST"]
print(match.text) # e.g., "CHESTNUT ST"
print(match.score) # e.g., 0.95
```
#### CLI Utility
Command: `tg-invoke-row-embeddings`
```bash
# Query by text (computes embedding automatically)
tg-invoke-row-embeddings \
--text "Chestnut Street" \
--collection my_collection \
--schema addresses \
--index street_name \
--limit 10
# Query by vector file
tg-invoke-row-embeddings \
--vectors vectors.json \
--collection my_collection \
--schema addresses
# Output formats
tg-invoke-row-embeddings --text "..." --format json
tg-invoke-row-embeddings --text "..." --format table
```
#### Typical Usage Pattern
The row embeddings query is typically used as part of a fuzzy-to-exact lookup flow:
```python
# Step 1: Fuzzy search via embeddings
matches = flow.row_embeddings_query(
text="chestnut street",
collection="geo",
schema_name="streets"
)
# Step 2: Exact lookup via GraphQL for full row data
for match in matches:
query = f'''
query {{
streets(where: {{ {match.index_name}: {{ eq: "{match.index_value[0]}" }} }}) {{
street_name
city
zip_code
}}
}}
'''
rows = flow.rows_query(query, collection="geo")
```
This two-step pattern enables:
- Finding "CHESTNUT ST" when user searches for "Chestnut Street"
- Retrieving complete row data with all fields
- Combining semantic similarity with structured data access
### Row Data Ingestion
Deferred to a subsequent phase. Will be designed alongside other ingestion changes.
## Implementation Impact
### Current State Analysis
The existing implementation has two main components:
| Component | Location | Lines | Description |
|-----------|----------|-------|-------------|
| Query Service | `trustgraph-flow/trustgraph/query/objects/cassandra/service.py` | ~740 | Monolithic: GraphQL schema generation, filter parsing, Cassandra queries, request handling |
| Writer | `trustgraph-flow/trustgraph/storage/objects/cassandra/write.py` | ~540 | Per-schema table creation, secondary indexes, insert/delete |
**Current Query Pattern:**
```sql
SELECT * FROM {keyspace}.o_{schema_name}
WHERE collection = 'X' AND email = 'foo@bar.com'
ALLOW FILTERING
```
**New Query Pattern:**
```sql
SELECT * FROM {keyspace}.rows
WHERE collection = 'X' AND schema_name = 'customers'
AND index_name = 'email' AND index_value = ['foo@bar.com']
```
### Key Changes
1. **Query semantics simplify**: The new schema only supports exact matches on `index_value`. The current GraphQL filters (`gt`, `lt`, `contains`, etc.) either:
- Become post-filtering on returned data (if still needed)
- Are removed in favor of using the embeddings API for fuzzy matching
2. **GraphQL code is tightly coupled**: The current `service.py` bundles Strawberry type generation, filter parsing, and Cassandra-specific queries. Adding another row store backend would duplicate ~400 lines of GraphQL code.
### Proposed Refactor
The refactor has two parts:
#### 1. Break Out GraphQL Code
Extract reusable GraphQL components into a shared module:
```
trustgraph-flow/trustgraph/query/graphql/
├── __init__.py
├── types.py # Filter types (IntFilter, StringFilter, FloatFilter)
├── schema.py # Dynamic schema generation from RowSchema
└── filters.py # Filter parsing utilities
```
This enables:
- Reuse across different row store backends
- Cleaner separation of concerns
- Easier testing of GraphQL logic independently
#### 2. Implement New Table Schema
Refactor the Cassandra-specific code to use the unified table:
**Writer** (`trustgraph-flow/trustgraph/storage/rows/cassandra/`):
- Single `rows` table instead of per-schema tables
- Write N copies per row (one per index)
- Register to `row_partitions` table
- Simpler table creation (one-time setup)
**Query Service** (`trustgraph-flow/trustgraph/query/rows/cassandra/`):
- Query the unified `rows` table
- Use extracted GraphQL module for schema generation
- Simplified filter handling (exact match only at DB level)
### Module Renames
As part of the "object" → "row" naming cleanup:
| Current | New |
|---------|-----|
| `storage/objects/cassandra/` | `storage/rows/cassandra/` |
| `query/objects/cassandra/` | `query/rows/cassandra/` |
| `embeddings/object_embeddings/` | `embeddings/row_embeddings/` |
### New Modules
| Module | Purpose |
|--------|---------|
| `trustgraph-flow/trustgraph/query/graphql/` | Shared GraphQL utilities |
| `trustgraph-flow/trustgraph/query/row_embeddings/qdrant/` | Row embeddings query API |
| `trustgraph-flow/trustgraph/embeddings/row_embeddings/` | Row embeddings computation (Stage 1) |
| `trustgraph-flow/trustgraph/storage/row_embeddings/qdrant/` | Row embeddings storage (Stage 2) |
## References
- [Structured Data Technical Specification](structured-data.md)