Merge 2.0 to master (#651)

This commit is contained in:
cybermaggedon 2026-02-28 11:03:14 +00:00 committed by GitHub
parent 3666ece2c5
commit b9d7bf9a8b
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
212 changed files with 13940 additions and 6180 deletions

View file

@ -0,0 +1,260 @@
# Entity-Centric Knowledge Graph Storage on Cassandra
## Overview
This document describes a storage model for RDF-style knowledge graphs on Apache Cassandra. The model uses an **entity-centric** approach where every entity knows every quad it participates in and the role it plays. This replaces a traditional multi-table SPO permutation approach with just two tables.
## Background and Motivation
### The Traditional Approach
A standard RDF quad store on Cassandra requires multiple denormalised tables to cover query patterns — typically 6 or more tables representing different permutations of Subject, Predicate, Object, and Dataset (SPOD). Each quad is written to every table, resulting in significant write amplification, operational overhead, and schema complexity.
Additionally, label resolution (fetching human-readable names for entities) requires separate round-trip queries, which is particularly costly in AI and GraphRAG use cases where labels are essential for LLM context.
### The Entity-Centric Insight
Every quad `(D, S, P, O)` involves up to 4 entities. By writing a row for each entity's participation in the quad, we guarantee that **any query with at least one known element will hit a partition key**. This covers all 16 query patterns with a single data table.
Key benefits:
- **2 tables** instead of 7+
- **4 writes per quad** instead of 6+
- **Label resolution for free** — an entity's labels are co-located with its relationships, naturally warming the application cache
- **All 16 query patterns** served by single-partition reads
- **Simpler operations** — one data table to tune, compact, and repair
## Schema
### Table 1: quads_by_entity
The primary data table. Every entity has a partition containing all quads it participates in. Named to reflect the query pattern (lookup by entity).
```sql
CREATE TABLE quads_by_entity (
collection text, -- Collection/tenant scope (always specified)
entity text, -- The entity this row is about
role text, -- 'S', 'P', 'O', 'G' — how this entity participates
p text, -- Predicate of the quad
otype text, -- 'U' (URI), 'L' (literal), 'T' (triple/reification)
s text, -- Subject of the quad
o text, -- Object of the quad
d text, -- Dataset/graph of the quad
dtype text, -- XSD datatype (when otype = 'L'), e.g. 'xsd:string'
lang text, -- Language tag (when otype = 'L'), e.g. 'en', 'fr'
PRIMARY KEY ((collection, entity), role, p, otype, s, o, d)
);
```
**Partition key**: `(collection, entity)` — scoped to collection, one partition per entity.
**Clustering column order rationale**:
1. **role** — most queries start with "where is this entity a subject/object"
2. **p** — next most common filter, "give me all `knows` relationships"
3. **otype** — enables filtering by URI-valued vs literal-valued relationships
4. **s, o, d** — remaining columns for uniqueness
### Table 2: quads_by_collection
Supports collection-level queries and deletion. Provides a manifest of all quads belonging to a collection. Named to reflect the query pattern (lookup by collection).
```sql
CREATE TABLE quads_by_collection (
collection text,
d text, -- Dataset/graph of the quad
s text, -- Subject of the quad
p text, -- Predicate of the quad
o text, -- Object of the quad
otype text, -- 'U' (URI), 'L' (literal), 'T' (triple/reification)
dtype text, -- XSD datatype (when otype = 'L')
lang text, -- Language tag (when otype = 'L')
PRIMARY KEY (collection, d, s, p, o)
);
```
Clustered by dataset first, enabling deletion at either collection or dataset granularity.
## Write Path
For each incoming quad `(D, S, P, O)` within a collection `C`, write **4 rows** to `quads_by_entity` and **1 row** to `quads_by_collection`.
### Example
Given the quad in collection `tenant1`:
```
Dataset: https://example.org/graph1
Subject: https://example.org/Alice
Predicate: https://example.org/knows
Object: https://example.org/Bob
```
Write 4 rows to `quads_by_entity`:
| collection | entity | role | p | otype | s | o | d |
|---|---|---|---|---|---|---|---|
| tenant1 | https://example.org/graph1 | G | https://example.org/knows | U | https://example.org/Alice | https://example.org/Bob | https://example.org/graph1 |
| tenant1 | https://example.org/Alice | S | https://example.org/knows | U | https://example.org/Alice | https://example.org/Bob | https://example.org/graph1 |
| tenant1 | https://example.org/knows | P | https://example.org/knows | U | https://example.org/Alice | https://example.org/Bob | https://example.org/graph1 |
| tenant1 | https://example.org/Bob | O | https://example.org/knows | U | https://example.org/Alice | https://example.org/Bob | https://example.org/graph1 |
Write 1 row to `quads_by_collection`:
| collection | d | s | p | o | otype | dtype | lang |
|---|---|---|---|---|---|---|---|
| tenant1 | https://example.org/graph1 | https://example.org/Alice | https://example.org/knows | https://example.org/Bob | U | | |
### Literal Example
For a label triple:
```
Dataset: https://example.org/graph1
Subject: https://example.org/Alice
Predicate: http://www.w3.org/2000/01/rdf-schema#label
Object: "Alice Smith" (lang: en)
```
The `otype` is `'L'`, `dtype` is `'xsd:string'`, and `lang` is `'en'`. The literal value `"Alice Smith"` is stored in `o`. Only 3 rows are needed in `quads_by_entity` — no row is written for the literal as entity, since literals are not independently queryable entities.
## Query Patterns
### All 16 DSPO Patterns
In the table below, "Perfect prefix" means the query uses a contiguous prefix of the clustering columns. "Partition scan + filter" means Cassandra reads a slice of one partition and filters in memory — still efficient, just not a pure prefix match.
| # | Known | Lookup entity | Clustering prefix | Efficiency |
|---|---|---|---|---|
| 1 | D,S,P,O | entity=S, role='S', p=P | Full match | Perfect prefix |
| 2 | D,S,P,? | entity=S, role='S', p=P | Filter on D | Partition scan + filter |
| 3 | D,S,?,O | entity=S, role='S' | Filter on D, O | Partition scan + filter |
| 4 | D,?,P,O | entity=O, role='O', p=P | Filter on D | Partition scan + filter |
| 5 | ?,S,P,O | entity=S, role='S', p=P | Filter on O | Partition scan + filter |
| 6 | D,S,?,? | entity=S, role='S' | Filter on D | Partition scan + filter |
| 7 | D,?,P,? | entity=P, role='P' | Filter on D | Partition scan + filter |
| 8 | D,?,?,O | entity=O, role='O' | Filter on D | Partition scan + filter |
| 9 | ?,S,P,? | entity=S, role='S', p=P | — | **Perfect prefix** |
| 10 | ?,S,?,O | entity=S, role='S' | Filter on O | Partition scan + filter |
| 11 | ?,?,P,O | entity=O, role='O', p=P | — | **Perfect prefix** |
| 12 | D,?,?,? | entity=D, role='G' | — | **Perfect prefix** |
| 13 | ?,S,?,? | entity=S, role='S' | — | **Perfect prefix** |
| 14 | ?,?,P,? | entity=P, role='P' | — | **Perfect prefix** |
| 15 | ?,?,?,O | entity=O, role='O' | — | **Perfect prefix** |
| 16 | ?,?,?,? | — | Full scan | Exploration only |
**Key result**: 7 of the 15 non-trivial patterns are perfect clustering prefix hits. The remaining 8 are single-partition reads with in-partition filtering. Every query with at least one known element hits a partition key.
Pattern 16 (?,?,?,?) does not occur in practice since collection is always specified, reducing it to pattern 12.
### Common Query Examples
**Everything about an entity:**
```sql
SELECT * FROM quads_by_entity
WHERE collection = 'tenant1' AND entity = 'https://example.org/Alice';
```
**All outgoing relationships for an entity:**
```sql
SELECT * FROM quads_by_entity
WHERE collection = 'tenant1' AND entity = 'https://example.org/Alice'
AND role = 'S';
```
**Specific predicate for an entity:**
```sql
SELECT * FROM quads_by_entity
WHERE collection = 'tenant1' AND entity = 'https://example.org/Alice'
AND role = 'S' AND p = 'https://example.org/knows';
```
**Label for an entity (specific language):**
```sql
SELECT * FROM quads_by_entity
WHERE collection = 'tenant1' AND entity = 'https://example.org/Alice'
AND role = 'S' AND p = 'http://www.w3.org/2000/01/rdf-schema#label'
AND otype = 'L';
```
Then filter by `lang = 'en'` application-side if needed.
**Only URI-valued relationships (entity-to-entity links):**
```sql
SELECT * FROM quads_by_entity
WHERE collection = 'tenant1' AND entity = 'https://example.org/Alice'
AND role = 'S' AND p = 'https://example.org/knows' AND otype = 'U';
```
**Reverse lookup — what points to this entity:**
```sql
SELECT * FROM quads_by_entity
WHERE collection = 'tenant1' AND entity = 'https://example.org/Bob'
AND role = 'O';
```
## Label Resolution and Cache Warming
One of the most significant advantages of the entity-centric model is that **label resolution becomes a free side effect**.
In the traditional multi-table model, fetching labels requires separate round-trip queries: retrieve triples, identify entity URIs in the results, then fetch `rdfs:label` for each. This N+1 pattern is expensive.
In the entity-centric model, querying an entity returns **all** its quads — including its labels, types, and other properties. When the application caches query results, labels are pre-warmed before anything asks for them.
Two usage regimes confirm this works well in practice:
- **Human-facing queries**: naturally small result sets, labels essential. Entity reads pre-warm the cache.
- **AI/bulk queries**: large result sets with hard limits. Labels either unnecessary or needed only for a curated subset of entities already in cache.
The theoretical concern of resolving labels for huge result sets (e.g. 30,000 entities) is mitigated by the practical observation that no human or AI consumer usefully processes that many labels. Application-level query limits ensure cache pressure remains manageable.
## Wide Partitions and Reification
Reification (RDF-star style statements about statements) creates hub entities — e.g. a source document that supports thousands of extracted facts. This can produce wide partitions.
Mitigating factors:
- **Application-level query limits**: all GraphRAG and human-facing queries enforce hard limits, so wide partitions are never fully scanned on the hot read path
- **Cassandra handles partial reads efficiently**: a clustering column scan with an early stop is fast even on large partitions
- **Collection deletion** (the only operation that might traverse full partitions) is an accepted background process
## Collection Deletion
Triggered by API call, runs in the background (eventually consistent).
1. Read `quads_by_collection` for the target collection to get all quads
2. Extract unique entities from the quads (s, p, o, d values)
3. For each unique entity, delete the partition from `quads_by_entity`
4. Delete the rows from `quads_by_collection`
The `quads_by_collection` table provides the index needed to locate all entity partitions without a full table scan. Partition-level deletes are efficient since `(collection, entity)` is the partition key.
## Migration Path from Multi-Table Model
The entity-centric model can coexist with the existing multi-table model during migration:
1. Deploy `quads_by_entity` and `quads_by_collection` tables alongside existing tables
2. Dual-write new quads to both old and new tables
3. Backfill existing data into the new tables
4. Migrate read paths one query pattern at a time
5. Decommission old tables once all reads are migrated
## Summary
| Aspect | Traditional (6-table) | Entity-centric (2-table) |
|---|---|---|
| Tables | 7+ | 2 |
| Writes per quad | 6+ | 5 (4 data + 1 manifest) |
| Label resolution | Separate round trips | Free via cache warming |
| Query patterns | 16 across 6 tables | 16 on 1 table |
| Schema complexity | High | Low |
| Operational overhead | 6 tables to tune/repair | 1 data table |
| Reification support | Additional complexity | Natural fit |
| Object type filtering | Not available | Native (via otype clustering) |

View file

@ -0,0 +1,573 @@
# Graph Contexts Technical Specification
## Overview
This specification describes changes to TrustGraph's core graph primitives to
align with RDF 1.2 and support full RDF Dataset semantics. This is a breaking
change for the 2.x release series.
### Versioning
- **2.0**: Early adopter release. Core features available, may not be fully
production-ready.
- **2.1 / 2.2**: Production release. Stability and completeness validated.
Flexibility on maturity is intentional - early adopters can access new
capabilities before all features are production-hardened.
## Goals
The primary goals for this work are to enable metadata about facts/statements:
- **Temporal information**: Associate facts with time metadata
- When a fact was believed to be true
- When a fact became true
- When a fact was discovered to be false
- **Provenance/Sources**: Track which sources support a fact
- "This fact was supported by source X"
- Link facts back to their origin documents
- **Veracity/Trust**: Record assertions about truth
- "Person P asserted this was true"
- "Person Q claims this is false"
- Enable trust scoring and conflict detection
**Hypothesis**: Reification (RDF-star / quoted triples) is the key mechanism
to achieve these outcomes, as all require making statements about statements.
## Background
To express "the fact (Alice knows Bob) was discovered on 2024-01-15" or
"source X supports the claim (Y causes Z)", you need to reference an edge
as a thing you can make statements about. Standard triples don't support this.
### Current Limitations
The current `Value` class in `trustgraph-base/trustgraph/schema/core/primitives.py`
can represent:
- URI nodes (`is_uri=True`)
- Literal values (`is_uri=False`)
The `type` field exists but is not used to represent XSD datatypes.
## Technical Design
### RDF Features to Support
#### Core Features (Related to Reification Goals)
These features are directly related to the temporal, provenance, and veracity
goals:
1. **RDF 1.2 Quoted Triples (RDF-star)**
- Edges that point at other edges
- A Triple can appear as the subject or object of another Triple
- Enables statements about statements (reification)
- Core mechanism for annotating individual facts
2. **RDF Dataset / Named Graphs**
- Support for multiple named graphs within a dataset
- Each graph identified by an IRI
- Moves from triples (s, p, o) to quads (s, p, o, g)
- Includes a default graph plus zero or more named graphs
- The graph IRI can be a subject in statements, e.g.:
```
<graph-source-A> <discoveredOn> "2024-01-15"
<graph-source-A> <hasVeracity> "high"
```
- Note: Named graphs are a separate feature from reification. They have
uses beyond statement annotation (partitioning, access control, dataset
organization) and should be treated as a distinct capability.
3. **Blank Nodes** (Limited Support)
- Anonymous nodes without a global URI
- Supported for compatibility when loading external RDF data
- **Limited status**: No guarantees about stable identity after loading
- Find them via wildcard queries (match by connections, not by ID)
- Not a first-class feature - don't rely on precise blank node handling
#### Opportunistic Fixes (2.0 Breaking Change)
These features are not directly related to the reification goals but are
valuable improvements to include while making breaking changes:
4. **Literal Datatypes**
- Properly use the `type` field for XSD datatypes
- Examples: xsd:string, xsd:integer, xsd:dateTime, etc.
- Fixes current limitation: cannot represent dates or integers properly
5. **Language Tags**
- Support for language attributes on string literals (@en, @fr, etc.)
- Note: A literal has either a language tag OR a datatype, not both
(except for rdf:langString)
- Important for AI/multilingual use cases
### Data Models
#### Term (rename from Value)
The `Value` class will be renamed to `Term` to better reflect RDF terminology.
This rename serves two purposes:
1. Aligns naming with RDF concepts (a "Term" can be an IRI, literal, blank
node, or quoted triple - not just a "value")
2. Forces code review at the breaking change interface - any code still
referencing `Value` is visibly broken and needs updating
A Term can represent:
- **IRI/URI** - A named node/resource
- **Blank Node** - An anonymous node with local scope
- **Literal** - A data value with either:
- A datatype (XSD type), OR
- A language tag
- **Quoted Triple** - A triple used as a term (RDF 1.2)
##### Chosen Approach: Single Class with Type Discriminator
Serialization requirements drive the structure - a type discriminator is needed
in the wire format regardless of the Python representation. A single class with
a type field is the natural fit and aligns with the current `Value` pattern.
Single-character type codes provide compact serialization:
```python
from dataclasses import dataclass
# Term type constants
IRI = "i" # IRI/URI node
BLANK = "b" # Blank node
LITERAL = "l" # Literal value
TRIPLE = "t" # Quoted triple (RDF-star)
@dataclass
class Term:
type: str = "" # One of: IRI, BLANK, LITERAL, TRIPLE
# For IRI terms (type == IRI)
iri: str = ""
# For blank nodes (type == BLANK)
id: str = ""
# For literals (type == LITERAL)
value: str = ""
datatype: str = "" # XSD datatype URI (mutually exclusive with language)
language: str = "" # Language tag (mutually exclusive with datatype)
# For quoted triples (type == TRIPLE)
triple: "Triple | None" = None
```
Usage examples:
```python
# IRI term
node = Term(type=IRI, iri="http://example.org/Alice")
# Literal with datatype
age = Term(type=LITERAL, value="42", datatype="xsd:integer")
# Literal with language tag
label = Term(type=LITERAL, value="Hello", language="en")
# Blank node
anon = Term(type=BLANK, id="_:b1")
# Quoted triple (statement about a statement)
inner = Triple(
s=Term(type=IRI, iri="http://example.org/Alice"),
p=Term(type=IRI, iri="http://example.org/knows"),
o=Term(type=IRI, iri="http://example.org/Bob"),
)
reified = Term(type=TRIPLE, triple=inner)
```
##### Alternatives Considered
**Option B: Union of specialized classes** (`Term = IRI | BlankNode | Literal | QuotedTriple`)
- Rejected: Serialization would still need a type discriminator, adding complexity
**Option C: Base class with subclasses**
- Rejected: Same serialization issue, plus dataclass inheritance quirks
#### Triple / Quad
The `Triple` class gains an optional graph field to become a quad:
```python
@dataclass
class Triple:
s: Term | None = None # Subject
p: Term | None = None # Predicate
o: Term | None = None # Object
g: str | None = None # Graph name (IRI), None = default graph
```
Design decisions:
- **Field name**: `g` for consistency with `s`, `p`, `o`
- **Optional**: `None` means the default graph (unnamed)
- **Type**: Plain string (IRI) rather than Term
- Graph names are always IRIs
- Blank nodes as graph names ruled out (too confusing)
- No need for the full Term machinery
Note: The class name stays `Triple` even though it's technically a quad now.
This avoids churn and "triple" is still the common terminology for the s/p/o
portion. The graph context is metadata about where the triple lives.
### Candidate Query Patterns
The current query engine accepts combinations of S, P, O terms. With quoted
triples, a triple itself becomes a valid term in those positions. Below are
candidate query patterns that support the original goals.
#### Graph Parameter Semantics
Following SPARQL conventions for backward compatibility:
- **`g` omitted / None**: Query the default graph only
- **`g` = specific IRI**: Query that named graph only
- **`g` = wildcard / `*`**: Query across all graphs (equivalent to SPARQL
`GRAPH ?g { ... }`)
This keeps simple queries simple and makes named graph queries opt-in.
Cross-graph queries (g=wildcard) are fully supported. The Cassandra schema
includes dedicated tables (SPOG, POSG, OSPG) where g is a clustering column
rather than a partition key, enabling efficient queries across all graphs.
#### Temporal Queries
**Find all facts discovered after a given date:**
```
S: ? # any quoted triple
P: <discoveredOn>
O: > "2024-01-15"^^xsd:date # date comparison
```
**Find when a specific fact was believed true:**
```
S: << <Alice> <knows> <Bob> >> # quoted triple as subject
P: <believedTrueFrom>
O: ? # returns the date
```
**Find facts that became false:**
```
S: ? # any quoted triple
P: <discoveredFalseOn>
O: ? # has any value (exists)
```
#### Provenance Queries
**Find all facts supported by a specific source:**
```
S: ? # any quoted triple
P: <supportedBy>
O: <source:document-123>
```
**Find which sources support a specific fact:**
```
S: << <DrugA> <treats> <DiseaseB> >> # quoted triple as subject
P: <supportedBy>
O: ? # returns source IRIs
```
#### Veracity Queries
**Find assertions a person marked as true:**
```
S: ? # any quoted triple
P: <assertedTrueBy>
O: <person:Alice>
```
**Find conflicting assertions (same fact, different veracity):**
```
# First query: facts asserted true
S: ?
P: <assertedTrueBy>
O: ?
# Second query: facts asserted false
S: ?
P: <assertedFalseBy>
O: ?
# Application logic: find intersection of subjects
```
**Find facts with trust score below threshold:**
```
S: ? # any quoted triple
P: <trustScore>
O: < 0.5 # numeric comparison
```
### Architecture
Significant changes required across multiple components:
#### This Repository (trustgraph)
- **Schema primitives** (`trustgraph-base/trustgraph/schema/core/primitives.py`)
- Value → Term rename
- New Term structure with type discriminator
- Triple gains `g` field for graph context
- **Message translators** (`trustgraph-base/trustgraph/messaging/translators/`)
- Update for new Term/Triple structures
- Serialization/deserialization for new fields
- **Gateway components**
- Handle new Term and quad structures
- **Knowledge cores**
- Core changes to support quads and reification
- **Knowledge manager**
- Schema changes propagate here
- **Storage layers**
- Cassandra: Schema redesign (see Implementation Details)
- Other backends: Deferred to later phases
- **Command-line utilities**
- Update for new data structures
- **REST API documentation**
- OpenAPI spec updates
#### External Repositories
- **Python API** (this repo)
- Client library updates for new structures
- **TypeScript APIs** (separate repo)
- Client library updates
- **Workbench** (separate repo)
- Significant state management changes
### APIs
#### REST API
- Documented in OpenAPI spec
- Will need updates for new Term/Triple structures
- New endpoints may be needed for graph context operations
#### Python API (this repo)
- Client library changes to match new primitives
- Breaking changes to Term (was Value) and Triple
#### TypeScript API (separate repo)
- Parallel changes to Python API
- Separate release coordination
#### Workbench (separate repo)
- Significant state management changes
- UI updates for graph context features
### Implementation Details
#### Phased Storage Implementation
Multiple graph store backends exist (Cassandra, Neo4j, etc.). Implementation
will proceed in phases:
1. **Phase 1: Cassandra**
- Start with the home-grown Cassandra store
- Full control over the storage layer enables rapid iteration
- Schema will be redesigned from scratch for quads + reification
- Validate the data model and query patterns against real use cases
#### Cassandra Schema Design
Cassandra requires multiple tables to support different query access patterns
(each table efficiently queries by its partition key + clustering columns).
##### Query Patterns
With quads (g, s, p, o), each position can be specified or wildcard, giving
16 possible query patterns:
| # | g | s | p | o | Description |
|---|---|---|---|---|-------------|
| 1 | ? | ? | ? | ? | All quads |
| 2 | ? | ? | ? | o | By object |
| 3 | ? | ? | p | ? | By predicate |
| 4 | ? | ? | p | o | By predicate + object |
| 5 | ? | s | ? | ? | By subject |
| 6 | ? | s | ? | o | By subject + object |
| 7 | ? | s | p | ? | By subject + predicate |
| 8 | ? | s | p | o | Full triple (which graphs?) |
| 9 | g | ? | ? | ? | By graph |
| 10 | g | ? | ? | o | By graph + object |
| 11 | g | ? | p | ? | By graph + predicate |
| 12 | g | ? | p | o | By graph + predicate + object |
| 13 | g | s | ? | ? | By graph + subject |
| 14 | g | s | ? | o | By graph + subject + object |
| 15 | g | s | p | ? | By graph + subject + predicate |
| 16 | g | s | p | o | Exact quad |
##### Table Design
Cassandra constraint: You can only efficiently query by partition key, then
filter on clustering columns left-to-right. For g-wildcard queries, g must be
a clustering column. For g-specified queries, g in the partition key is more
efficient.
**Two table families needed:**
**Family A: g-wildcard queries** (g in clustering columns)
| Table | Partition | Clustering | Supports patterns |
|-------|-----------|------------|-------------------|
| SPOG | (user, collection, s) | p, o, g | 5, 7, 8 |
| POSG | (user, collection, p) | o, s, g | 3, 4 |
| OSPG | (user, collection, o) | s, p, g | 2, 6 |
**Family B: g-specified queries** (g in partition key)
| Table | Partition | Clustering | Supports patterns |
|-------|-----------|------------|-------------------|
| GSPO | (user, collection, g, s) | p, o | 9, 13, 15, 16 |
| GPOS | (user, collection, g, p) | o, s | 11, 12 |
| GOSP | (user, collection, g, o) | s, p | 10, 14 |
**Collection table** (for iteration and bulk deletion)
| Table | Partition | Clustering | Purpose |
|-------|-----------|------------|---------|
| COLL | (user, collection) | g, s, p, o | Enumerate all quads in collection |
##### Write and Delete Paths
**Write path**: Insert into all 7 tables.
**Delete collection path**:
1. Iterate COLL table for `(user, collection)`
2. For each quad, delete from all 6 query tables
3. Delete from COLL table (or range delete)
**Delete single quad path**: Delete from all 7 tables directly.
##### Storage Cost
Each quad is stored 7 times. This is the cost of flexible querying combined
with efficient collection deletion.
##### Quoted Triples in Storage
Subject or object can be a triple itself. Options:
**Option A: Serialize quoted triples to canonical string**
```
S: "<<http://ex/Alice|http://ex/knows|http://ex/Bob>>"
P: http://ex/discoveredOn
O: "2024-01-15"
G: null
```
- Store quoted triple as serialized string in S or O columns
- Query by exact match on serialized form
- Pro: Simple, fits existing index patterns
- Con: Can't query "find triples where quoted subject's predicate is X"
**Option B: Triple IDs / Hashes**
```
Triple table:
id: hash(s,p,o,g)
s, p, o, g: ...
Metadata table:
subject_triple_id: <hash>
p: http://ex/discoveredOn
o: "2024-01-15"
```
- Assign each triple an ID (hash of components)
- Reification metadata references triples by ID
- Pro: Clean separation, can index triple IDs
- Con: Requires computing/managing triple identity, two-phase lookups
**Recommendation**: Start with Option A (serialized strings) for simplicity.
Option B may be needed if advanced query patterns over quoted triple
components are required.
2. **Phase 2+: Other Backends**
- Neo4j and other stores implemented in subsequent stages
- Lessons learned from Cassandra inform these implementations
This approach de-risks the design by validating on a fully-controlled backend
before committing to implementations across all stores.
#### Value → Term Rename
The `Value` class will be renamed to `Term`. This affects ~78 files across
the codebase. The rename acts as a forcing function: any code still using
`Value` is immediately identifiable as needing review/update for 2.0
compatibility.
## Security Considerations
Named graphs are not a security feature. Users and collections remain the
security boundaries. Named graphs are purely for data organization and
reification support.
## Performance Considerations
- Quoted triples add nesting depth - may impact query performance
- Named graph indexing strategies needed for efficient graph-scoped queries
- Cassandra schema design will need to accommodate quad storage efficiently
### Vector Store Boundary
Vector stores always reference IRIs only:
- Never edges (quoted triples)
- Never literal values
- Never blank nodes
This keeps the vector store simple - it handles semantic similarity of named
entities. The graph structure handles relationships, reification, and metadata.
Quoted triples and named graphs don't complicate vector operations.
## Testing Strategy
Use existing test strategy. As this is a breaking change, extensive focus on
the end-to-end test suite to validate the new structures work correctly across
all components.
## Migration Plan
- 2.0 is a breaking release; no backward compatibility required
- Existing data may need migration to new schema (TBD based on final design)
- Consider migration tooling for converting existing triples
## Open Questions
- **Blank nodes**: Limited support confirmed. May need to decide on
skolemization strategy (generate IRIs on load, or preserve blank node IDs).
- **Query syntax**: What is the concrete syntax for specifying quoted triples
in queries? Need to define the query API.
- ~~**Predicate vocabulary**~~: Resolved. Any valid RDF predicates permitted,
including custom user-defined. Minimal assumptions about RDF validity.
Very few locked-in values (e.g., `rdfs:label` used in some places).
Strategy: avoid locking anything in unless absolutely necessary.
- ~~**Vector store impact**~~: Resolved. Vector stores always point to IRIs
only - never edges, literals, or blank nodes. Quoted triples and
reification don't affect the vector store.
- ~~**Named graph semantics**~~: Resolved. Queries default to the default
graph (matches SPARQL behavior, backward compatible). Explicit graph
parameter required to query named graphs or all graphs.
## References
- [RDF 1.2 Concepts](https://www.w3.org/TR/rdf12-concepts/)
- [RDF-star and SPARQL-star](https://w3c.github.io/rdf-star/)
- [RDF Dataset](https://www.w3.org/TR/rdf11-concepts/#section-dataset)

View file

@ -0,0 +1,455 @@
# JSONL Prompt Output Technical Specification
## Overview
This specification describes the implementation of JSONL (JSON Lines) output
format for prompt responses in TrustGraph. JSONL enables truncation-resilient
extraction of structured data from LLM responses, addressing critical issues
with JSON array outputs being corrupted when LLM responses hit output token
limits.
This implementation supports the following use cases:
1. **Truncation-Resilient Extraction**: Extract valid partial results even when
LLM output is truncated mid-response
2. **Large-Scale Extraction**: Handle extraction of many items without risk of
complete failure due to token limits
3. **Mixed-Type Extraction**: Support extraction of multiple entity types
(definitions, relationships, entities, attributes) in a single prompt
4. **Streaming-Compatible Output**: Enable future streaming/incremental
processing of extraction results
## Goals
- **Backward Compatibility**: Existing prompts using `response-type: "text"` and
`response-type: "json"` continue to work without modification
- **Truncation Resilience**: Partial LLM outputs yield partial valid results
rather than complete failure
- **Schema Validation**: Support JSON Schema validation for individual objects
- **Discriminated Unions**: Support mixed-type outputs using a `type` field
discriminator
- **Minimal API Changes**: Extend existing prompt configuration with new
response type and schema key
## Background
### Current Architecture
The prompt service supports two response types:
1. `response-type: "text"` - Raw text response returned as-is
2. `response-type: "json"` - JSON parsed from response, validated against
optional `schema`
Current implementation in `trustgraph-flow/trustgraph/template/prompt_manager.py`:
```python
class Prompt:
def __init__(self, template, response_type = "text", terms=None, schema=None):
self.template = template
self.response_type = response_type
self.terms = terms
self.schema = schema
```
### Current Limitations
When extraction prompts request output as JSON arrays (`[{...}, {...}, ...]`):
- **Truncation corruption**: If the LLM hits output token limits mid-array, the
entire response becomes invalid JSON and cannot be parsed
- **All-or-nothing parsing**: Must receive complete output before parsing
- **No partial results**: A truncated response yields zero usable data
- **Unreliable for large extractions**: More extracted items = higher failure risk
This specification addresses these limitations by introducing JSONL format for
extraction prompts, where each extracted item is a complete JSON object on its
own line.
## Technical Design
### Response Type Extension
Add a new response type `"jsonl"` alongside existing `"text"` and `"json"` types.
#### Configuration Changes
**New response type value:**
```
"response-type": "jsonl"
```
**Schema interpretation:**
The existing `"schema"` key is used for both `"json"` and `"jsonl"` response
types. The interpretation depends on the response type:
- `"json"`: Schema describes the entire response (typically an array or object)
- `"jsonl"`: Schema describes each individual line/object
```json
{
"response-type": "jsonl",
"schema": {
"type": "object",
"properties": {
"entity": { "type": "string" },
"definition": { "type": "string" }
},
"required": ["entity", "definition"]
}
}
```
This avoids changes to prompt configuration tooling and editors.
### JSONL Format Specification
#### Simple Extraction
For prompts extracting a single type of object (definitions, relationships,
topics, rows), the output is one JSON object per line with no wrapper:
**Prompt output format:**
```
{"entity": "photosynthesis", "definition": "Process by which plants convert sunlight"}
{"entity": "chlorophyll", "definition": "Green pigment in plants"}
{"entity": "mitochondria", "definition": "Powerhouse of the cell"}
```
**Contrast with previous JSON array format:**
```json
[
{"entity": "photosynthesis", "definition": "Process by which plants convert sunlight"},
{"entity": "chlorophyll", "definition": "Green pigment in plants"},
{"entity": "mitochondria", "definition": "Powerhouse of the cell"}
]
```
If the LLM truncates after line 2, the JSON array format yields invalid JSON,
while JSONL yields two valid objects.
#### Mixed-Type Extraction (Discriminated Unions)
For prompts extracting multiple types of objects (e.g., both definitions and
relationships, or entities, relationships, and attributes), use a `"type"`
field as discriminator:
**Prompt output format:**
```
{"type": "definition", "entity": "DNA", "definition": "Molecule carrying genetic instructions"}
{"type": "relationship", "subject": "DNA", "predicate": "located_in", "object": "cell nucleus", "object-entity": true}
{"type": "definition", "entity": "RNA", "definition": "Molecule that carries genetic information"}
{"type": "relationship", "subject": "RNA", "predicate": "transcribed_from", "object": "DNA", "object-entity": true}
```
**Schema for discriminated unions uses `oneOf`:**
```json
{
"response-type": "jsonl",
"schema": {
"oneOf": [
{
"type": "object",
"properties": {
"type": { "const": "definition" },
"entity": { "type": "string" },
"definition": { "type": "string" }
},
"required": ["type", "entity", "definition"]
},
{
"type": "object",
"properties": {
"type": { "const": "relationship" },
"subject": { "type": "string" },
"predicate": { "type": "string" },
"object": { "type": "string" },
"object-entity": { "type": "boolean" }
},
"required": ["type", "subject", "predicate", "object", "object-entity"]
}
]
}
}
```
#### Ontology Extraction
For ontology-based extraction with entities, relationships, and attributes:
**Prompt output format:**
```
{"type": "entity", "entity": "Cornish pasty", "entity_type": "fo/Recipe"}
{"type": "entity", "entity": "beef", "entity_type": "fo/Food"}
{"type": "relationship", "subject": "Cornish pasty", "subject_type": "fo/Recipe", "relation": "fo/has_ingredient", "object": "beef", "object_type": "fo/Food"}
{"type": "attribute", "entity": "Cornish pasty", "entity_type": "fo/Recipe", "attribute": "fo/serves", "value": "4 people"}
```
### Implementation Details
#### Prompt Class
The existing `Prompt` class requires no changes. The `schema` field is reused
for JSONL, with its interpretation determined by `response_type`:
```python
class Prompt:
def __init__(self, template, response_type="text", terms=None, schema=None):
self.template = template
self.response_type = response_type
self.terms = terms
self.schema = schema # Interpretation depends on response_type
```
#### PromptManager.load_config
No changes required - existing configuration loading already handles the
`schema` key.
#### JSONL Parsing
Add a new parsing method for JSONL responses:
```python
def parse_jsonl(self, text):
"""
Parse JSONL response, returning list of valid objects.
Invalid lines (malformed JSON, empty lines) are skipped with warnings.
This provides truncation resilience - partial output yields partial results.
"""
results = []
for line_num, line in enumerate(text.strip().split('\n'), 1):
line = line.strip()
# Skip empty lines
if not line:
continue
# Skip markdown code fence markers if present
if line.startswith('```'):
continue
try:
obj = json.loads(line)
results.append(obj)
except json.JSONDecodeError as e:
# Log warning but continue - this provides truncation resilience
logger.warning(f"JSONL parse error on line {line_num}: {e}")
return results
```
#### PromptManager.invoke Changes
Extend the invoke method to handle the new response type:
```python
async def invoke(self, id, input, llm):
logger.debug("Invoking prompt template...")
terms = self.terms | self.prompts[id].terms | input
resp_type = self.prompts[id].response_type
prompt = {
"system": self.system_template.render(terms),
"prompt": self.render(id, input)
}
resp = await llm(**prompt)
if resp_type == "text":
return resp
if resp_type == "json":
try:
obj = self.parse_json(resp)
except:
logger.error(f"JSON parse failed: {resp}")
raise RuntimeError("JSON parse fail")
if self.prompts[id].schema:
try:
validate(instance=obj, schema=self.prompts[id].schema)
logger.debug("Schema validation successful")
except Exception as e:
raise RuntimeError(f"Schema validation fail: {e}")
return obj
if resp_type == "jsonl":
objects = self.parse_jsonl(resp)
if not objects:
logger.warning("JSONL parse returned no valid objects")
return []
# Validate each object against schema if provided
if self.prompts[id].schema:
validated = []
for i, obj in enumerate(objects):
try:
validate(instance=obj, schema=self.prompts[id].schema)
validated.append(obj)
except Exception as e:
logger.warning(f"Object {i} failed schema validation: {e}")
return validated
return objects
raise RuntimeError(f"Response type {resp_type} not known")
```
### Affected Prompts
The following prompts should be migrated to JSONL format:
| Prompt ID | Description | Type Field |
|-----------|-------------|------------|
| `extract-definitions` | Entity/definition extraction | No (single type) |
| `extract-relationships` | Relationship extraction | No (single type) |
| `extract-topics` | Topic/definition extraction | No (single type) |
| `extract-rows` | Structured row extraction | No (single type) |
| `agent-kg-extract` | Combined definition + relationship extraction | Yes: `"definition"`, `"relationship"` |
| `extract-with-ontologies` / `ontology-extract` | Ontology-based extraction | Yes: `"entity"`, `"relationship"`, `"attribute"` |
### API Changes
#### Client Perspective
JSONL parsing is transparent to prompt service API callers. The parsing occurs
server-side in the prompt service, and the response is returned via the standard
`PromptResponse.object` field as a serialized JSON array.
When clients call the prompt service (via `PromptClient.prompt()` or similar):
- **`response-type: "json"`** with array schema → client receives Python `list`
- **`response-type: "jsonl"`** → client receives Python `list`
From the client's perspective, both return identical data structures. The
difference is entirely in how the LLM output is parsed server-side:
- JSON array format: Single `json.loads()` call; fails completely if truncated
- JSONL format: Line-by-line parsing; yields partial results if truncated
This means existing client code expecting a list from extraction prompts
requires no changes when migrating prompts from JSON to JSONL format.
#### Server Return Value
For `response-type: "jsonl"`, the `PromptManager.invoke()` method returns a
`list[dict]` containing all successfully parsed and validated objects. This
list is then serialized to JSON for the `PromptResponse.object` field.
#### Error Handling
- Empty results: Returns empty list `[]` with warning log
- Partial parse failure: Returns list of successfully parsed objects with
warning logs for failures
- Complete parse failure: Returns empty list `[]` with warning logs
This differs from `response-type: "json"` which raises `RuntimeError` on
parse failure. The lenient behavior for JSONL is intentional to provide
truncation resilience.
### Configuration Example
Complete prompt configuration example:
```json
{
"prompt": "Extract all entities and their definitions from the following text. Output one JSON object per line.\n\nText:\n{{text}}\n\nOutput format per line:\n{\"entity\": \"<name>\", \"definition\": \"<definition>\"}",
"response-type": "jsonl",
"schema": {
"type": "object",
"properties": {
"entity": {
"type": "string",
"description": "The entity name"
},
"definition": {
"type": "string",
"description": "A clear definition of the entity"
}
},
"required": ["entity", "definition"]
}
}
```
## Security Considerations
- **Input Validation**: JSON parsing uses standard `json.loads()` which is safe
against injection attacks
- **Schema Validation**: Uses `jsonschema.validate()` for schema enforcement
- **No New Attack Surface**: JSONL parsing is strictly safer than JSON array
parsing due to line-by-line processing
## Performance Considerations
- **Memory**: Line-by-line parsing uses less peak memory than loading full JSON
arrays
- **Latency**: Parsing performance is comparable to JSON array parsing
- **Validation**: Schema validation runs per-object, which adds overhead but
enables partial results on validation failure
## Testing Strategy
### Unit Tests
- JSONL parsing with valid input
- JSONL parsing with empty lines
- JSONL parsing with markdown code fences
- JSONL parsing with truncated final line
- JSONL parsing with invalid JSON lines interspersed
- Schema validation with `oneOf` discriminated unions
- Backward compatibility: existing `"text"` and `"json"` prompts unchanged
### Integration Tests
- End-to-end extraction with JSONL prompts
- Extraction with simulated truncation (artificially limited response)
- Mixed-type extraction with type discriminator
- Ontology extraction with all three types
### Extraction Quality Tests
- Compare extraction results: JSONL vs JSON array format
- Verify truncation resilience: JSONL yields partial results where JSON fails
## Migration Plan
### Phase 1: Implementation
1. Implement `parse_jsonl()` method in `PromptManager`
2. Extend `invoke()` to handle `response-type: "jsonl"`
3. Add unit tests
### Phase 2: Prompt Migration
1. Update `extract-definitions` prompt and configuration
2. Update `extract-relationships` prompt and configuration
3. Update `extract-topics` prompt and configuration
4. Update `extract-rows` prompt and configuration
5. Update `agent-kg-extract` prompt and configuration
6. Update `extract-with-ontologies` prompt and configuration
### Phase 3: Downstream Updates
1. Update any code consuming extraction results to handle list return type
2. Update code that categorizes mixed-type extractions by `type` field
3. Update tests that assert on extraction output format
## Open Questions
None at this time.
## References
- Current implementation: `trustgraph-flow/trustgraph/template/prompt_manager.py`
- JSON Lines specification: https://jsonlines.org/
- JSON Schema `oneOf`: https://json-schema.org/understanding-json-schema/reference/combining.html#oneof
- Related specification: Streaming LLM Responses (`docs/tech-specs/streaming-llm-responses.md`)

View file

@ -0,0 +1,613 @@
# Structured Data Technical Specification (Part 2)
## Overview
This specification addresses issues and gaps identified during the initial implementation of TrustGraph's structured data integration, as described in `structured-data.md`.
## Problem Statements
### 1. Naming Inconsistency: "Object" vs "Row"
The current implementation uses "object" terminology throughout (e.g., `ExtractedObject`, object extraction, object embeddings). This naming is too generic and causes confusion:
- "Object" is an overloaded term in software (Python objects, JSON objects, etc.)
- The data being handled is fundamentally tabular - rows in tables with defined schemas
- "Row" more accurately describes the data model and aligns with database terminology
This inconsistency appears in module names, class names, message types, and documentation.
### 2. Row Store Query Limitations
The current row store implementation has significant query limitations:
**Natural Language Mismatch**: Queries struggle with real-world data variations. For example:
- A street database containing `"CHESTNUT ST"` is difficult to find when asking about `"Chestnut Street"`
- Abbreviations, case differences, and formatting variations break exact-match queries
- Users expect semantic understanding, but the store provides literal matching
**Schema Evolution Issues**: Changing schemas causes problems:
- Existing data may not conform to updated schemas
- Table structure changes can break queries and data integrity
- No clear migration path for schema updates
### 3. Row Embeddings Required
Related to problem 2, the system needs vector embeddings for row data to enable:
- Semantic search across structured data (finding "Chestnut Street" when data contains "CHESTNUT ST")
- Similarity matching for fuzzy queries
- Hybrid search combining structured filters with semantic similarity
- Better natural language query support
The embedding service was specified but not implemented.
### 4. Row Data Ingestion Incomplete
The structured data ingestion pipeline is not fully operational:
- Diagnostic prompts exist to classify input formats (CSV, JSON, etc.)
- The ingestion service that uses these prompts is not plumbed into the system
- No end-to-end path for loading pre-structured data into the row store
## Goals
- **Schema Flexibility**: Enable schema evolution without breaking existing data or requiring migrations
- **Consistent Naming**: Standardize on "row" terminology throughout the codebase
- **Semantic Queryability**: Support fuzzy/semantic matching via row embeddings
- **Complete Ingestion Pipeline**: Provide end-to-end path for loading structured data
## Technical Design
### Unified Row Storage Schema
The previous implementation created a separate Cassandra table for each schema. This caused problems when schemas evolved, as table structure changes required migrations.
The new design uses a single unified table for all row data:
```sql
CREATE TABLE rows (
collection text,
schema_name text,
index_name text,
index_value frozen<list<text>>,
data map<text, text>,
source text,
PRIMARY KEY ((collection, schema_name, index_name), index_value)
)
```
#### Column Definitions
| Column | Type | Description |
|--------|------|-------------|
| `collection` | `text` | Data collection/import identifier (from metadata) |
| `schema_name` | `text` | Name of the schema this row conforms to |
| `index_name` | `text` | Name of the indexed field(s), comma-joined for composites |
| `index_value` | `frozen<list<text>>` | Index value(s) as a list |
| `data` | `map<text, text>` | Row data as key-value pairs |
| `source` | `text` | Optional URI linking to provenance information in the knowledge graph. Empty string or NULL indicates no source. |
#### Index Handling
Each row is stored multiple times - once per indexed field defined in the schema. The primary key fields are treated as an index with no special marker, providing future flexibility.
**Single-field index example:**
- Schema defines `email` as indexed
- `index_name = "email"`
- `index_value = ['foo@bar.com']`
**Composite index example:**
- Schema defines composite index on `region` and `status`
- `index_name = "region,status"` (field names sorted and comma-joined)
- `index_value = ['US', 'active']` (values in same order as field names)
**Primary key example:**
- Schema defines `customer_id` as primary key
- `index_name = "customer_id"`
- `index_value = ['CUST001']`
#### Query Patterns
All queries follow the same pattern regardless of which index is used:
```sql
SELECT * FROM rows
WHERE collection = 'import_2024'
AND schema_name = 'customers'
AND index_name = 'email'
AND index_value = ['foo@bar.com']
```
#### Design Trade-offs
**Advantages:**
- Schema changes don't require table structure changes
- Row data is opaque to Cassandra - field additions/removals are transparent
- Consistent query pattern for all access methods
- No Cassandra secondary indexes (which can be slow at scale)
- Native Cassandra types throughout (`map`, `frozen<list>`)
**Trade-offs:**
- Write amplification: each row insert = N inserts (one per indexed field)
- Storage overhead from duplicated row data
- Type information stored in schema config, conversion at application layer
#### Consistency Model
The design accepts certain simplifications:
1. **No row updates**: The system is append-only. This eliminates consistency concerns about updating multiple copies of the same row.
2. **Schema change tolerance**: When schemas change (e.g., indexes added/removed), existing rows retain their original indexing. Old rows won't be discoverable via new indexes. Users can delete and recreate a schema to ensure consistency if needed.
### Partition Tracking and Deletion
#### The Problem
With the partition key `(collection, schema_name, index_name)`, efficient deletion requires knowing all partition keys to delete. Deleting by just `collection` or `collection + schema_name` requires knowing all the `index_name` values that have data.
#### Partition Tracking Table
A secondary lookup table tracks which partitions exist:
```sql
CREATE TABLE row_partitions (
collection text,
schema_name text,
index_name text,
PRIMARY KEY ((collection), schema_name, index_name)
)
```
This enables efficient discovery of partitions for deletion operations.
#### Row Writer Behavior
The row writer maintains an in-memory cache of registered `(collection, schema_name)` pairs. When processing a row:
1. Check if `(collection, schema_name)` is in the cache
2. If not cached (first row for this pair):
- Look up the schema config to get all index names
- Insert entries into `row_partitions` for each `(collection, schema_name, index_name)`
- Add the pair to the cache
3. Proceed with writing the row data
The row writer also monitors schema config change events. When a schema changes, relevant cache entries are cleared so the next row triggers re-registration with the updated index names.
This approach ensures:
- Lookup table writes happen once per `(collection, schema_name)` pair, not per row
- The lookup table reflects the indexes that were active when data was written
- Schema changes mid-import are picked up correctly
#### Deletion Operations
**Delete collection:**
```sql
-- 1. Discover all partitions
SELECT schema_name, index_name FROM row_partitions WHERE collection = 'X';
-- 2. Delete each partition from rows table
DELETE FROM rows WHERE collection = 'X' AND schema_name = '...' AND index_name = '...';
-- (repeat for each discovered partition)
-- 3. Clean up the lookup table
DELETE FROM row_partitions WHERE collection = 'X';
```
**Delete collection + schema:**
```sql
-- 1. Discover partitions for this schema
SELECT index_name FROM row_partitions WHERE collection = 'X' AND schema_name = 'Y';
-- 2. Delete each partition from rows table
DELETE FROM rows WHERE collection = 'X' AND schema_name = 'Y' AND index_name = '...';
-- (repeat for each discovered partition)
-- 3. Clean up the lookup table entries
DELETE FROM row_partitions WHERE collection = 'X' AND schema_name = 'Y';
```
### Row Embeddings
Row embeddings enable semantic/fuzzy matching on indexed values, solving the natural language mismatch problem (e.g., finding "CHESTNUT ST" when querying for "Chestnut Street").
#### Design Overview
Each indexed value is embedded and stored in a vector store (Qdrant). At query time, the query is embedded, similar vectors are found, and the associated metadata is used to look up the actual rows in Cassandra.
#### Qdrant Collection Structure
One Qdrant collection per `(user, collection, schema_name, dimension)` tuple:
- **Collection naming:** `rows_{user}_{collection}_{schema_name}_{dimension}`
- Names are sanitized (non-alphanumeric characters replaced with `_`, lowercased, numeric prefixes get `r_` prefix)
- **Rationale:** Enables clean deletion of a `(user, collection, schema_name)` instance by dropping matching Qdrant collections; dimension suffix allows different embedding models to coexist
#### What Gets Embedded
The text representation of index values:
| Index Type | Example `index_value` | Text to Embed |
|------------|----------------------|---------------|
| Single-field | `['foo@bar.com']` | `"foo@bar.com"` |
| Composite | `['US', 'active']` | `"US active"` (space-joined) |
#### Point Structure
Each Qdrant point contains:
```json
{
"id": "<uuid>",
"vector": [0.1, 0.2, ...],
"payload": {
"index_name": "street_name",
"index_value": ["CHESTNUT ST"],
"text": "CHESTNUT ST"
}
}
```
| Payload Field | Description |
|---------------|-------------|
| `index_name` | The indexed field(s) this embedding represents |
| `index_value` | The original list of values (for Cassandra lookup) |
| `text` | The text that was embedded (for debugging/display) |
Note: `user`, `collection`, and `schema_name` are implicit from the Qdrant collection name.
#### Query Flow
1. User queries for "Chestnut Street" within user U, collection X, schema Y
2. Embed the query text
3. Determine Qdrant collection name(s) matching prefix `rows_U_X_Y_`
4. Search matching Qdrant collection(s) for nearest vectors
5. Get matching points with payloads containing `index_name` and `index_value`
6. Query Cassandra:
```sql
SELECT * FROM rows
WHERE collection = 'X'
AND schema_name = 'Y'
AND index_name = '<from payload>'
AND index_value = <from payload>
```
7. Return matched rows
#### Optional: Filtering by Index Name
Queries can optionally filter by `index_name` in Qdrant to search only specific fields:
- **"Find any field matching 'Chestnut'"** → search all vectors in the collection
- **"Find street_name matching 'Chestnut'"** → filter where `payload.index_name = 'street_name'`
#### Architecture
Row embeddings follow the **two-stage pattern** used by GraphRAG (graph-embeddings, document-embeddings):
- **Stage 1: Embedding computation** (`trustgraph-flow/trustgraph/embeddings/row_embeddings/`) - Consumes `ExtractedObject`, computes embeddings via the embeddings service, outputs `RowEmbeddings`
- **Stage 2: Embedding storage** (`trustgraph-flow/trustgraph/storage/row_embeddings/qdrant/`) - Consumes `RowEmbeddings`, writes vectors to Qdrant
The Cassandra row writer is a separate parallel consumer:
- **Cassandra row writer** (`trustgraph-flow/trustgraph/storage/rows/cassandra`) - Consumes `ExtractedObject`, writes rows to Cassandra
All three services consume from the same flow, keeping them decoupled. This allows:
- Independent scaling of Cassandra writes vs embedding generation vs vector storage
- Embedding services can be disabled if not needed
- Failures in one service don't affect the others
- Consistent architecture with GraphRAG pipelines
#### Write Path
**Stage 1 (row-embeddings processor):** When receiving an `ExtractedObject`:
1. Look up the schema to find indexed fields
2. For each indexed field:
- Build the text representation of the index value
- Compute embedding via the embeddings service
3. Output a `RowEmbeddings` message containing all computed vectors
**Stage 2 (row-embeddings-write-qdrant):** When receiving a `RowEmbeddings`:
1. For each embedding in the message:
- Determine Qdrant collection from `(user, collection, schema_name, dimension)`
- Create collection if needed (lazy creation on first write)
- Upsert point with vector and payload
#### Message Types
```python
@dataclass
class RowIndexEmbedding:
index_name: str # The indexed field name(s)
index_value: list[str] # The field value(s)
text: str # Text that was embedded
vectors: list[list[float]] # Computed embedding vectors
@dataclass
class RowEmbeddings:
metadata: Metadata
schema_name: str
embeddings: list[RowIndexEmbedding]
```
#### Deletion Integration
Qdrant collections are discovered by prefix matching on the collection name pattern:
**Delete `(user, collection)`:**
1. List all Qdrant collections matching prefix `rows_{user}_{collection}_`
2. Delete each matching collection
3. Delete Cassandra rows partitions (as documented above)
4. Clean up `row_partitions` entries
**Delete `(user, collection, schema_name)`:**
1. List all Qdrant collections matching prefix `rows_{user}_{collection}_{schema_name}_`
2. Delete each matching collection (handles multiple dimensions)
3. Delete Cassandra rows partitions
4. Clean up `row_partitions`
#### Module Locations
| Stage | Module | Entry Point |
|-------|--------|-------------|
| Stage 1 | `trustgraph-flow/trustgraph/embeddings/row_embeddings/` | `row-embeddings` |
| Stage 2 | `trustgraph-flow/trustgraph/storage/row_embeddings/qdrant/` | `row-embeddings-write-qdrant` |
### Row Embeddings Query API
The row embeddings query is a **separate API** from the GraphQL row query service:
| API | Purpose | Backend |
|-----|---------|---------|
| Row Query (GraphQL) | Exact matching on indexed fields | Cassandra |
| Row Embeddings Query | Fuzzy/semantic matching | Qdrant |
This separation keeps concerns clean:
- GraphQL service focuses on exact, structured queries
- Embeddings API handles semantic similarity
- User workflow: fuzzy search via embeddings to find candidates, then exact query to get full row data
#### Request/Response Schema
```python
@dataclass
class RowEmbeddingsRequest:
vectors: list[list[float]] # Query vectors (pre-computed embeddings)
user: str = ""
collection: str = ""
schema_name: str = ""
index_name: str = "" # Optional: filter to specific index
limit: int = 10 # Max results per vector
@dataclass
class RowIndexMatch:
index_name: str = "" # The matched index field(s)
index_value: list[str] = [] # The matched value(s)
text: str = "" # Original text that was embedded
score: float = 0.0 # Similarity score
@dataclass
class RowEmbeddingsResponse:
error: Error | None = None
matches: list[RowIndexMatch] = []
```
#### Query Processor
Module: `trustgraph-flow/trustgraph/query/row_embeddings/qdrant`
Entry point: `row-embeddings-query-qdrant`
The processor:
1. Receives `RowEmbeddingsRequest` with query vectors
2. Finds the appropriate Qdrant collection by prefix matching
3. Searches for nearest vectors with optional `index_name` filter
4. Returns `RowEmbeddingsResponse` with matching index information
#### API Gateway Integration
The gateway exposes row embeddings queries via the standard request/response pattern:
| Component | Location |
|-----------|----------|
| Dispatcher | `trustgraph-flow/trustgraph/gateway/dispatch/row_embeddings_query.py` |
| Registration | Add `"row-embeddings"` to `request_response_dispatchers` in `manager.py` |
Flow interface name: `row-embeddings`
Interface definition in flow blueprint:
```json
{
"interfaces": {
"row-embeddings": {
"request": "non-persistent://tg/request/row-embeddings:{id}",
"response": "non-persistent://tg/response/row-embeddings:{id}"
}
}
}
```
#### Python SDK Support
The SDK provides methods for row embeddings queries:
```python
# Flow-scoped query (preferred)
api = Api(url)
flow = api.flow().id("default")
# Query with text (SDK computes embeddings)
matches = flow.row_embeddings_query(
text="Chestnut Street",
collection="my_collection",
schema_name="addresses",
index_name="street_name", # Optional filter
limit=10
)
# Query with pre-computed vectors
matches = flow.row_embeddings_query(
vectors=[[0.1, 0.2, ...]],
collection="my_collection",
schema_name="addresses"
)
# Each match contains:
for match in matches:
print(match.index_name) # e.g., "street_name"
print(match.index_value) # e.g., ["CHESTNUT ST"]
print(match.text) # e.g., "CHESTNUT ST"
print(match.score) # e.g., 0.95
```
#### CLI Utility
Command: `tg-invoke-row-embeddings`
```bash
# Query by text (computes embedding automatically)
tg-invoke-row-embeddings \
--text "Chestnut Street" \
--collection my_collection \
--schema addresses \
--index street_name \
--limit 10
# Query by vector file
tg-invoke-row-embeddings \
--vectors vectors.json \
--collection my_collection \
--schema addresses
# Output formats
tg-invoke-row-embeddings --text "..." --format json
tg-invoke-row-embeddings --text "..." --format table
```
#### Typical Usage Pattern
The row embeddings query is typically used as part of a fuzzy-to-exact lookup flow:
```python
# Step 1: Fuzzy search via embeddings
matches = flow.row_embeddings_query(
text="chestnut street",
collection="geo",
schema_name="streets"
)
# Step 2: Exact lookup via GraphQL for full row data
for match in matches:
query = f'''
query {{
streets(where: {{ {match.index_name}: {{ eq: "{match.index_value[0]}" }} }}) {{
street_name
city
zip_code
}}
}}
'''
rows = flow.rows_query(query, collection="geo")
```
This two-step pattern enables:
- Finding "CHESTNUT ST" when user searches for "Chestnut Street"
- Retrieving complete row data with all fields
- Combining semantic similarity with structured data access
### Row Data Ingestion
Deferred to a subsequent phase. Will be designed alongside other ingestion changes.
## Implementation Impact
### Current State Analysis
The existing implementation has two main components:
| Component | Location | Lines | Description |
|-----------|----------|-------|-------------|
| Query Service | `trustgraph-flow/trustgraph/query/objects/cassandra/service.py` | ~740 | Monolithic: GraphQL schema generation, filter parsing, Cassandra queries, request handling |
| Writer | `trustgraph-flow/trustgraph/storage/objects/cassandra/write.py` | ~540 | Per-schema table creation, secondary indexes, insert/delete |
**Current Query Pattern:**
```sql
SELECT * FROM {keyspace}.o_{schema_name}
WHERE collection = 'X' AND email = 'foo@bar.com'
ALLOW FILTERING
```
**New Query Pattern:**
```sql
SELECT * FROM {keyspace}.rows
WHERE collection = 'X' AND schema_name = 'customers'
AND index_name = 'email' AND index_value = ['foo@bar.com']
```
### Key Changes
1. **Query semantics simplify**: The new schema only supports exact matches on `index_value`. The current GraphQL filters (`gt`, `lt`, `contains`, etc.) either:
- Become post-filtering on returned data (if still needed)
- Are removed in favor of using the embeddings API for fuzzy matching
2. **GraphQL code is tightly coupled**: The current `service.py` bundles Strawberry type generation, filter parsing, and Cassandra-specific queries. Adding another row store backend would duplicate ~400 lines of GraphQL code.
### Proposed Refactor
The refactor has two parts:
#### 1. Break Out GraphQL Code
Extract reusable GraphQL components into a shared module:
```
trustgraph-flow/trustgraph/query/graphql/
├── __init__.py
├── types.py # Filter types (IntFilter, StringFilter, FloatFilter)
├── schema.py # Dynamic schema generation from RowSchema
└── filters.py # Filter parsing utilities
```
This enables:
- Reuse across different row store backends
- Cleaner separation of concerns
- Easier testing of GraphQL logic independently
#### 2. Implement New Table Schema
Refactor the Cassandra-specific code to use the unified table:
**Writer** (`trustgraph-flow/trustgraph/storage/rows/cassandra/`):
- Single `rows` table instead of per-schema tables
- Write N copies per row (one per index)
- Register to `row_partitions` table
- Simpler table creation (one-time setup)
**Query Service** (`trustgraph-flow/trustgraph/query/rows/cassandra/`):
- Query the unified `rows` table
- Use extracted GraphQL module for schema generation
- Simplified filter handling (exact match only at DB level)
### Module Renames
As part of the "object" → "row" naming cleanup:
| Current | New |
|---------|-----|
| `storage/objects/cassandra/` | `storage/rows/cassandra/` |
| `query/objects/cassandra/` | `query/rows/cassandra/` |
| `embeddings/object_embeddings/` | `embeddings/row_embeddings/` |
### New Modules
| Module | Purpose |
|--------|---------|
| `trustgraph-flow/trustgraph/query/graphql/` | Shared GraphQL utilities |
| `trustgraph-flow/trustgraph/query/row_embeddings/qdrant/` | Row embeddings query API |
| `trustgraph-flow/trustgraph/embeddings/row_embeddings/` | Row embeddings computation (Stage 1) |
| `trustgraph-flow/trustgraph/storage/row_embeddings/qdrant/` | Row embeddings storage (Stage 2) |
## References
- [Structured Data Technical Specification](structured-data.md)