21 KiB
Structured Data Technical Specification (Part 2)
Overview
This specification addresses issues and gaps identified during the initial implementation of TrustGraph's structured data integration, as described in structured-data.md.
Problem Statements
1. Naming Inconsistency: "Object" vs "Row"
The current implementation uses "object" terminology throughout (e.g., ExtractedObject, object extraction, object embeddings). This naming is too generic and causes confusion:
- "Object" is an overloaded term in software (Python objects, JSON objects, etc.)
- The data being handled is fundamentally tabular - rows in tables with defined schemas
- "Row" more accurately describes the data model and aligns with database terminology
This inconsistency appears in module names, class names, message types, and documentation.
2. Row Store Query Limitations
The current row store implementation has significant query limitations:
Natural Language Mismatch: Queries struggle with real-world data variations. For example:
- A street database containing
"CHESTNUT ST"is difficult to find when asking about"Chestnut Street" - Abbreviations, case differences, and formatting variations break exact-match queries
- Users expect semantic understanding, but the store provides literal matching
Schema Evolution Issues: Changing schemas causes problems:
- Existing data may not conform to updated schemas
- Table structure changes can break queries and data integrity
- No clear migration path for schema updates
3. Row Embeddings Required
Related to problem 2, the system needs vector embeddings for row data to enable:
- Semantic search across structured data (finding "Chestnut Street" when data contains "CHESTNUT ST")
- Similarity matching for fuzzy queries
- Hybrid search combining structured filters with semantic similarity
- Better natural language query support
The embedding service was specified but not implemented.
4. Row Data Ingestion Incomplete
The structured data ingestion pipeline is not fully operational:
- Diagnostic prompts exist to classify input formats (CSV, JSON, etc.)
- The ingestion service that uses these prompts is not plumbed into the system
- No end-to-end path for loading pre-structured data into the row store
Goals
- Schema Flexibility: Enable schema evolution without breaking existing data or requiring migrations
- Consistent Naming: Standardize on "row" terminology throughout the codebase
- Semantic Queryability: Support fuzzy/semantic matching via row embeddings
- Complete Ingestion Pipeline: Provide end-to-end path for loading structured data
Technical Design
Unified Row Storage Schema
The previous implementation created a separate Cassandra table for each schema. This caused problems when schemas evolved, as table structure changes required migrations.
The new design uses a single unified table for all row data:
CREATE TABLE rows (
collection text,
schema_name text,
index_name text,
index_value frozen<list<text>>,
data map<text, text>,
source text,
PRIMARY KEY ((collection, schema_name, index_name), index_value)
)
Column Definitions
| Column | Type | Description |
|---|---|---|
collection |
text |
Data collection/import identifier (from metadata) |
schema_name |
text |
Name of the schema this row conforms to |
index_name |
text |
Name of the indexed field(s), comma-joined for composites |
index_value |
frozen<list<text>> |
Index value(s) as a list |
data |
map<text, text> |
Row data as key-value pairs |
source |
text |
Optional URI linking to provenance information in the knowledge graph. Empty string or NULL indicates no source. |
Index Handling
Each row is stored multiple times - once per indexed field defined in the schema. The primary key fields are treated as an index with no special marker, providing future flexibility.
Single-field index example:
- Schema defines
emailas indexed index_name = "email"index_value = ['foo@bar.com']
Composite index example:
- Schema defines composite index on
regionandstatus index_name = "region,status"(field names sorted and comma-joined)index_value = ['US', 'active'](values in same order as field names)
Primary key example:
- Schema defines
customer_idas primary key index_name = "customer_id"index_value = ['CUST001']
Query Patterns
All queries follow the same pattern regardless of which index is used:
SELECT * FROM rows
WHERE collection = 'import_2024'
AND schema_name = 'customers'
AND index_name = 'email'
AND index_value = ['foo@bar.com']
Design Trade-offs
Advantages:
- Schema changes don't require table structure changes
- Row data is opaque to Cassandra - field additions/removals are transparent
- Consistent query pattern for all access methods
- No Cassandra secondary indexes (which can be slow at scale)
- Native Cassandra types throughout (
map,frozen<list>)
Trade-offs:
- Write amplification: each row insert = N inserts (one per indexed field)
- Storage overhead from duplicated row data
- Type information stored in schema config, conversion at application layer
Consistency Model
The design accepts certain simplifications:
-
No row updates: The system is append-only. This eliminates consistency concerns about updating multiple copies of the same row.
-
Schema change tolerance: When schemas change (e.g., indexes added/removed), existing rows retain their original indexing. Old rows won't be discoverable via new indexes. Users can delete and recreate a schema to ensure consistency if needed.
Partition Tracking and Deletion
The Problem
With the partition key (collection, schema_name, index_name), efficient deletion requires knowing all partition keys to delete. Deleting by just collection or collection + schema_name requires knowing all the index_name values that have data.
Partition Tracking Table
A secondary lookup table tracks which partitions exist:
CREATE TABLE row_partitions (
collection text,
schema_name text,
index_name text,
PRIMARY KEY ((collection), schema_name, index_name)
)
This enables efficient discovery of partitions for deletion operations.
Row Writer Behavior
The row writer maintains an in-memory cache of registered (collection, schema_name) pairs. When processing a row:
- Check if
(collection, schema_name)is in the cache - If not cached (first row for this pair):
- Look up the schema config to get all index names
- Insert entries into
row_partitionsfor each(collection, schema_name, index_name) - Add the pair to the cache
- Proceed with writing the row data
The row writer also monitors schema config change events. When a schema changes, relevant cache entries are cleared so the next row triggers re-registration with the updated index names.
This approach ensures:
- Lookup table writes happen once per
(collection, schema_name)pair, not per row - The lookup table reflects the indexes that were active when data was written
- Schema changes mid-import are picked up correctly
Deletion Operations
Delete collection:
-- 1. Discover all partitions
SELECT schema_name, index_name FROM row_partitions WHERE collection = 'X';
-- 2. Delete each partition from rows table
DELETE FROM rows WHERE collection = 'X' AND schema_name = '...' AND index_name = '...';
-- (repeat for each discovered partition)
-- 3. Clean up the lookup table
DELETE FROM row_partitions WHERE collection = 'X';
Delete collection + schema:
-- 1. Discover partitions for this schema
SELECT index_name FROM row_partitions WHERE collection = 'X' AND schema_name = 'Y';
-- 2. Delete each partition from rows table
DELETE FROM rows WHERE collection = 'X' AND schema_name = 'Y' AND index_name = '...';
-- (repeat for each discovered partition)
-- 3. Clean up the lookup table entries
DELETE FROM row_partitions WHERE collection = 'X' AND schema_name = 'Y';
Row Embeddings
Row embeddings enable semantic/fuzzy matching on indexed values, solving the natural language mismatch problem (e.g., finding "CHESTNUT ST" when querying for "Chestnut Street").
Design Overview
Each indexed value is embedded and stored in a vector store (Qdrant). At query time, the query is embedded, similar vectors are found, and the associated metadata is used to look up the actual rows in Cassandra.
Qdrant Collection Structure
One Qdrant collection per (user, collection, schema_name, dimension) tuple:
- Collection naming:
rows_{user}_{collection}_{schema_name}_{dimension} - Names are sanitized (non-alphanumeric characters replaced with
_, lowercased, numeric prefixes getr_prefix) - Rationale: Enables clean deletion of a
(user, collection, schema_name)instance by dropping matching Qdrant collections; dimension suffix allows different embedding models to coexist
What Gets Embedded
The text representation of index values:
| Index Type | Example index_value |
Text to Embed |
|---|---|---|
| Single-field | ['foo@bar.com'] |
"foo@bar.com" |
| Composite | ['US', 'active'] |
"US active" (space-joined) |
Point Structure
Each Qdrant point contains:
{
"id": "<uuid>",
"vector": [0.1, 0.2, ...],
"payload": {
"index_name": "street_name",
"index_value": ["CHESTNUT ST"],
"text": "CHESTNUT ST"
}
}
| Payload Field | Description |
|---|---|
index_name |
The indexed field(s) this embedding represents |
index_value |
The original list of values (for Cassandra lookup) |
text |
The text that was embedded (for debugging/display) |
Note: user, collection, and schema_name are implicit from the Qdrant collection name.
Query Flow
- User queries for "Chestnut Street" within user U, collection X, schema Y
- Embed the query text
- Determine Qdrant collection name(s) matching prefix
rows_U_X_Y_ - Search matching Qdrant collection(s) for nearest vectors
- Get matching points with payloads containing
index_nameandindex_value - Query Cassandra:
SELECT * FROM rows WHERE collection = 'X' AND schema_name = 'Y' AND index_name = '<from payload>' AND index_value = <from payload> - Return matched rows
Optional: Filtering by Index Name
Queries can optionally filter by index_name in Qdrant to search only specific fields:
- "Find any field matching 'Chestnut'" → search all vectors in the collection
- "Find street_name matching 'Chestnut'" → filter where
payload.index_name = 'street_name'
Architecture
Row embeddings follow the two-stage pattern used by GraphRAG (graph-embeddings, document-embeddings):
- Stage 1: Embedding computation (
trustgraph-flow/trustgraph/embeddings/row_embeddings/) - ConsumesExtractedObject, computes embeddings via the embeddings service, outputsRowEmbeddings - Stage 2: Embedding storage (
trustgraph-flow/trustgraph/storage/row_embeddings/qdrant/) - ConsumesRowEmbeddings, writes vectors to Qdrant
The Cassandra row writer is a separate parallel consumer:
- Cassandra row writer (
trustgraph-flow/trustgraph/storage/rows/cassandra) - ConsumesExtractedObject, writes rows to Cassandra
All three services consume from the same flow, keeping them decoupled. This allows:
- Independent scaling of Cassandra writes vs embedding generation vs vector storage
- Embedding services can be disabled if not needed
- Failures in one service don't affect the others
- Consistent architecture with GraphRAG pipelines
Write Path
Stage 1 (row-embeddings processor): When receiving an ExtractedObject:
- Look up the schema to find indexed fields
- For each indexed field:
- Build the text representation of the index value
- Compute embedding via the embeddings service
- Output a
RowEmbeddingsmessage containing all computed vectors
Stage 2 (row-embeddings-write-qdrant): When receiving a RowEmbeddings:
- For each embedding in the message:
- Determine Qdrant collection from
(user, collection, schema_name, dimension) - Create collection if needed (lazy creation on first write)
- Upsert point with vector and payload
- Determine Qdrant collection from
Message Types
@dataclass
class RowIndexEmbedding:
index_name: str # The indexed field name(s)
index_value: list[str] # The field value(s)
text: str # Text that was embedded
vectors: list[list[float]] # Computed embedding vectors
@dataclass
class RowEmbeddings:
metadata: Metadata
schema_name: str
embeddings: list[RowIndexEmbedding]
Deletion Integration
Qdrant collections are discovered by prefix matching on the collection name pattern:
Delete (user, collection):
- List all Qdrant collections matching prefix
rows_{user}_{collection}_ - Delete each matching collection
- Delete Cassandra rows partitions (as documented above)
- Clean up
row_partitionsentries
Delete (user, collection, schema_name):
- List all Qdrant collections matching prefix
rows_{user}_{collection}_{schema_name}_ - Delete each matching collection (handles multiple dimensions)
- Delete Cassandra rows partitions
- Clean up
row_partitions
Module Locations
| Stage | Module | Entry Point |
|---|---|---|
| Stage 1 | trustgraph-flow/trustgraph/embeddings/row_embeddings/ |
row-embeddings |
| Stage 2 | trustgraph-flow/trustgraph/storage/row_embeddings/qdrant/ |
row-embeddings-write-qdrant |
Row Embeddings Query API
The row embeddings query is a separate API from the GraphQL row query service:
| API | Purpose | Backend |
|---|---|---|
| Row Query (GraphQL) | Exact matching on indexed fields | Cassandra |
| Row Embeddings Query | Fuzzy/semantic matching | Qdrant |
This separation keeps concerns clean:
- GraphQL service focuses on exact, structured queries
- Embeddings API handles semantic similarity
- User workflow: fuzzy search via embeddings to find candidates, then exact query to get full row data
Request/Response Schema
@dataclass
class RowEmbeddingsRequest:
vectors: list[list[float]] # Query vectors (pre-computed embeddings)
user: str = ""
collection: str = ""
schema_name: str = ""
index_name: str = "" # Optional: filter to specific index
limit: int = 10 # Max results per vector
@dataclass
class RowIndexMatch:
index_name: str = "" # The matched index field(s)
index_value: list[str] = [] # The matched value(s)
text: str = "" # Original text that was embedded
score: float = 0.0 # Similarity score
@dataclass
class RowEmbeddingsResponse:
error: Error | None = None
matches: list[RowIndexMatch] = []
Query Processor
Module: trustgraph-flow/trustgraph/query/row_embeddings/qdrant
Entry point: row-embeddings-query-qdrant
The processor:
- Receives
RowEmbeddingsRequestwith query vectors - Finds the appropriate Qdrant collection by prefix matching
- Searches for nearest vectors with optional
index_namefilter - Returns
RowEmbeddingsResponsewith matching index information
API Gateway Integration
The gateway exposes row embeddings queries via the standard request/response pattern:
| Component | Location |
|---|---|
| Dispatcher | trustgraph-flow/trustgraph/gateway/dispatch/row_embeddings_query.py |
| Registration | Add "row-embeddings" to request_response_dispatchers in manager.py |
Flow interface name: row-embeddings
Interface definition in flow blueprint:
{
"interfaces": {
"row-embeddings": {
"request": "non-persistent://tg/request/row-embeddings:{id}",
"response": "non-persistent://tg/response/row-embeddings:{id}"
}
}
}
Python SDK Support
The SDK provides methods for row embeddings queries:
# Flow-scoped query (preferred)
api = Api(url)
flow = api.flow().id("default")
# Query with text (SDK computes embeddings)
matches = flow.row_embeddings_query(
text="Chestnut Street",
collection="my_collection",
schema_name="addresses",
index_name="street_name", # Optional filter
limit=10
)
# Query with pre-computed vectors
matches = flow.row_embeddings_query(
vectors=[[0.1, 0.2, ...]],
collection="my_collection",
schema_name="addresses"
)
# Each match contains:
for match in matches:
print(match.index_name) # e.g., "street_name"
print(match.index_value) # e.g., ["CHESTNUT ST"]
print(match.text) # e.g., "CHESTNUT ST"
print(match.score) # e.g., 0.95
CLI Utility
Command: tg-invoke-row-embeddings
# Query by text (computes embedding automatically)
tg-invoke-row-embeddings \
--text "Chestnut Street" \
--collection my_collection \
--schema addresses \
--index street_name \
--limit 10
# Query by vector file
tg-invoke-row-embeddings \
--vectors vectors.json \
--collection my_collection \
--schema addresses
# Output formats
tg-invoke-row-embeddings --text "..." --format json
tg-invoke-row-embeddings --text "..." --format table
Typical Usage Pattern
The row embeddings query is typically used as part of a fuzzy-to-exact lookup flow:
# Step 1: Fuzzy search via embeddings
matches = flow.row_embeddings_query(
text="chestnut street",
collection="geo",
schema_name="streets"
)
# Step 2: Exact lookup via GraphQL for full row data
for match in matches:
query = f'''
query {{
streets(where: {{ {match.index_name}: {{ eq: "{match.index_value[0]}" }} }}) {{
street_name
city
zip_code
}}
}}
'''
rows = flow.rows_query(query, collection="geo")
This two-step pattern enables:
- Finding "CHESTNUT ST" when user searches for "Chestnut Street"
- Retrieving complete row data with all fields
- Combining semantic similarity with structured data access
Row Data Ingestion
Deferred to a subsequent phase. Will be designed alongside other ingestion changes.
Implementation Impact
Current State Analysis
The existing implementation has two main components:
| Component | Location | Lines | Description |
|---|---|---|---|
| Query Service | trustgraph-flow/trustgraph/query/objects/cassandra/service.py |
~740 | Monolithic: GraphQL schema generation, filter parsing, Cassandra queries, request handling |
| Writer | trustgraph-flow/trustgraph/storage/objects/cassandra/write.py |
~540 | Per-schema table creation, secondary indexes, insert/delete |
Current Query Pattern:
SELECT * FROM {keyspace}.o_{schema_name}
WHERE collection = 'X' AND email = 'foo@bar.com'
ALLOW FILTERING
New Query Pattern:
SELECT * FROM {keyspace}.rows
WHERE collection = 'X' AND schema_name = 'customers'
AND index_name = 'email' AND index_value = ['foo@bar.com']
Key Changes
-
Query semantics simplify: The new schema only supports exact matches on
index_value. The current GraphQL filters (gt,lt,contains, etc.) either:- Become post-filtering on returned data (if still needed)
- Are removed in favor of using the embeddings API for fuzzy matching
-
GraphQL code is tightly coupled: The current
service.pybundles Strawberry type generation, filter parsing, and Cassandra-specific queries. Adding another row store backend would duplicate ~400 lines of GraphQL code.
Proposed Refactor
The refactor has two parts:
1. Break Out GraphQL Code
Extract reusable GraphQL components into a shared module:
trustgraph-flow/trustgraph/query/graphql/
├── __init__.py
├── types.py # Filter types (IntFilter, StringFilter, FloatFilter)
├── schema.py # Dynamic schema generation from RowSchema
└── filters.py # Filter parsing utilities
This enables:
- Reuse across different row store backends
- Cleaner separation of concerns
- Easier testing of GraphQL logic independently
2. Implement New Table Schema
Refactor the Cassandra-specific code to use the unified table:
Writer (trustgraph-flow/trustgraph/storage/rows/cassandra/):
- Single
rowstable instead of per-schema tables - Write N copies per row (one per index)
- Register to
row_partitionstable - Simpler table creation (one-time setup)
Query Service (trustgraph-flow/trustgraph/query/rows/cassandra/):
- Query the unified
rowstable - Use extracted GraphQL module for schema generation
- Simplified filter handling (exact match only at DB level)
Module Renames
As part of the "object" → "row" naming cleanup:
| Current | New |
|---|---|
storage/objects/cassandra/ |
storage/rows/cassandra/ |
query/objects/cassandra/ |
query/rows/cassandra/ |
embeddings/object_embeddings/ |
embeddings/row_embeddings/ |
New Modules
| Module | Purpose |
|---|---|
trustgraph-flow/trustgraph/query/graphql/ |
Shared GraphQL utilities |
trustgraph-flow/trustgraph/query/row_embeddings/qdrant/ |
Row embeddings query API |
trustgraph-flow/trustgraph/embeddings/row_embeddings/ |
Row embeddings computation (Stage 1) |
trustgraph-flow/trustgraph/storage/row_embeddings/qdrant/ |
Row embeddings storage (Stage 2) |