mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-26 00:46:22 +02:00
Merge 2.0 to master (#651)
This commit is contained in:
parent
3666ece2c5
commit
b9d7bf9a8b
212 changed files with 13940 additions and 6180 deletions
260
docs/tech-specs/entity-centric-graph.md
Normal file
260
docs/tech-specs/entity-centric-graph.md
Normal file
|
|
@ -0,0 +1,260 @@
|
|||
# Entity-Centric Knowledge Graph Storage on Cassandra
|
||||
|
||||
## Overview
|
||||
|
||||
This document describes a storage model for RDF-style knowledge graphs on Apache Cassandra. The model uses an **entity-centric** approach where every entity knows every quad it participates in and the role it plays. This replaces a traditional multi-table SPO permutation approach with just two tables.
|
||||
|
||||
## Background and Motivation
|
||||
|
||||
### The Traditional Approach
|
||||
|
||||
A standard RDF quad store on Cassandra requires multiple denormalised tables to cover query patterns — typically 6 or more tables representing different permutations of Subject, Predicate, Object, and Dataset (SPOD). Each quad is written to every table, resulting in significant write amplification, operational overhead, and schema complexity.
|
||||
|
||||
Additionally, label resolution (fetching human-readable names for entities) requires separate round-trip queries, which is particularly costly in AI and GraphRAG use cases where labels are essential for LLM context.
|
||||
|
||||
### The Entity-Centric Insight
|
||||
|
||||
Every quad `(D, S, P, O)` involves up to 4 entities. By writing a row for each entity's participation in the quad, we guarantee that **any query with at least one known element will hit a partition key**. This covers all 16 query patterns with a single data table.
|
||||
|
||||
Key benefits:
|
||||
|
||||
- **2 tables** instead of 7+
|
||||
- **4 writes per quad** instead of 6+
|
||||
- **Label resolution for free** — an entity's labels are co-located with its relationships, naturally warming the application cache
|
||||
- **All 16 query patterns** served by single-partition reads
|
||||
- **Simpler operations** — one data table to tune, compact, and repair
|
||||
|
||||
## Schema
|
||||
|
||||
### Table 1: quads_by_entity
|
||||
|
||||
The primary data table. Every entity has a partition containing all quads it participates in. Named to reflect the query pattern (lookup by entity).
|
||||
|
||||
```sql
|
||||
CREATE TABLE quads_by_entity (
|
||||
collection text, -- Collection/tenant scope (always specified)
|
||||
entity text, -- The entity this row is about
|
||||
role text, -- 'S', 'P', 'O', 'G' — how this entity participates
|
||||
p text, -- Predicate of the quad
|
||||
otype text, -- 'U' (URI), 'L' (literal), 'T' (triple/reification)
|
||||
s text, -- Subject of the quad
|
||||
o text, -- Object of the quad
|
||||
d text, -- Dataset/graph of the quad
|
||||
dtype text, -- XSD datatype (when otype = 'L'), e.g. 'xsd:string'
|
||||
lang text, -- Language tag (when otype = 'L'), e.g. 'en', 'fr'
|
||||
PRIMARY KEY ((collection, entity), role, p, otype, s, o, d)
|
||||
);
|
||||
```
|
||||
|
||||
**Partition key**: `(collection, entity)` — scoped to collection, one partition per entity.
|
||||
|
||||
**Clustering column order rationale**:
|
||||
|
||||
1. **role** — most queries start with "where is this entity a subject/object"
|
||||
2. **p** — next most common filter, "give me all `knows` relationships"
|
||||
3. **otype** — enables filtering by URI-valued vs literal-valued relationships
|
||||
4. **s, o, d** — remaining columns for uniqueness
|
||||
|
||||
### Table 2: quads_by_collection
|
||||
|
||||
Supports collection-level queries and deletion. Provides a manifest of all quads belonging to a collection. Named to reflect the query pattern (lookup by collection).
|
||||
|
||||
```sql
|
||||
CREATE TABLE quads_by_collection (
|
||||
collection text,
|
||||
d text, -- Dataset/graph of the quad
|
||||
s text, -- Subject of the quad
|
||||
p text, -- Predicate of the quad
|
||||
o text, -- Object of the quad
|
||||
otype text, -- 'U' (URI), 'L' (literal), 'T' (triple/reification)
|
||||
dtype text, -- XSD datatype (when otype = 'L')
|
||||
lang text, -- Language tag (when otype = 'L')
|
||||
PRIMARY KEY (collection, d, s, p, o)
|
||||
);
|
||||
```
|
||||
|
||||
Clustered by dataset first, enabling deletion at either collection or dataset granularity.
|
||||
|
||||
## Write Path
|
||||
|
||||
For each incoming quad `(D, S, P, O)` within a collection `C`, write **4 rows** to `quads_by_entity` and **1 row** to `quads_by_collection`.
|
||||
|
||||
### Example
|
||||
|
||||
Given the quad in collection `tenant1`:
|
||||
|
||||
```
|
||||
Dataset: https://example.org/graph1
|
||||
Subject: https://example.org/Alice
|
||||
Predicate: https://example.org/knows
|
||||
Object: https://example.org/Bob
|
||||
```
|
||||
|
||||
Write 4 rows to `quads_by_entity`:
|
||||
|
||||
| collection | entity | role | p | otype | s | o | d |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| tenant1 | https://example.org/graph1 | G | https://example.org/knows | U | https://example.org/Alice | https://example.org/Bob | https://example.org/graph1 |
|
||||
| tenant1 | https://example.org/Alice | S | https://example.org/knows | U | https://example.org/Alice | https://example.org/Bob | https://example.org/graph1 |
|
||||
| tenant1 | https://example.org/knows | P | https://example.org/knows | U | https://example.org/Alice | https://example.org/Bob | https://example.org/graph1 |
|
||||
| tenant1 | https://example.org/Bob | O | https://example.org/knows | U | https://example.org/Alice | https://example.org/Bob | https://example.org/graph1 |
|
||||
|
||||
Write 1 row to `quads_by_collection`:
|
||||
|
||||
| collection | d | s | p | o | otype | dtype | lang |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| tenant1 | https://example.org/graph1 | https://example.org/Alice | https://example.org/knows | https://example.org/Bob | U | | |
|
||||
|
||||
### Literal Example
|
||||
|
||||
For a label triple:
|
||||
|
||||
```
|
||||
Dataset: https://example.org/graph1
|
||||
Subject: https://example.org/Alice
|
||||
Predicate: http://www.w3.org/2000/01/rdf-schema#label
|
||||
Object: "Alice Smith" (lang: en)
|
||||
```
|
||||
|
||||
The `otype` is `'L'`, `dtype` is `'xsd:string'`, and `lang` is `'en'`. The literal value `"Alice Smith"` is stored in `o`. Only 3 rows are needed in `quads_by_entity` — no row is written for the literal as entity, since literals are not independently queryable entities.
|
||||
|
||||
## Query Patterns
|
||||
|
||||
### All 16 DSPO Patterns
|
||||
|
||||
In the table below, "Perfect prefix" means the query uses a contiguous prefix of the clustering columns. "Partition scan + filter" means Cassandra reads a slice of one partition and filters in memory — still efficient, just not a pure prefix match.
|
||||
|
||||
| # | Known | Lookup entity | Clustering prefix | Efficiency |
|
||||
|---|---|---|---|---|
|
||||
| 1 | D,S,P,O | entity=S, role='S', p=P | Full match | Perfect prefix |
|
||||
| 2 | D,S,P,? | entity=S, role='S', p=P | Filter on D | Partition scan + filter |
|
||||
| 3 | D,S,?,O | entity=S, role='S' | Filter on D, O | Partition scan + filter |
|
||||
| 4 | D,?,P,O | entity=O, role='O', p=P | Filter on D | Partition scan + filter |
|
||||
| 5 | ?,S,P,O | entity=S, role='S', p=P | Filter on O | Partition scan + filter |
|
||||
| 6 | D,S,?,? | entity=S, role='S' | Filter on D | Partition scan + filter |
|
||||
| 7 | D,?,P,? | entity=P, role='P' | Filter on D | Partition scan + filter |
|
||||
| 8 | D,?,?,O | entity=O, role='O' | Filter on D | Partition scan + filter |
|
||||
| 9 | ?,S,P,? | entity=S, role='S', p=P | — | **Perfect prefix** |
|
||||
| 10 | ?,S,?,O | entity=S, role='S' | Filter on O | Partition scan + filter |
|
||||
| 11 | ?,?,P,O | entity=O, role='O', p=P | — | **Perfect prefix** |
|
||||
| 12 | D,?,?,? | entity=D, role='G' | — | **Perfect prefix** |
|
||||
| 13 | ?,S,?,? | entity=S, role='S' | — | **Perfect prefix** |
|
||||
| 14 | ?,?,P,? | entity=P, role='P' | — | **Perfect prefix** |
|
||||
| 15 | ?,?,?,O | entity=O, role='O' | — | **Perfect prefix** |
|
||||
| 16 | ?,?,?,? | — | Full scan | Exploration only |
|
||||
|
||||
**Key result**: 7 of the 15 non-trivial patterns are perfect clustering prefix hits. The remaining 8 are single-partition reads with in-partition filtering. Every query with at least one known element hits a partition key.
|
||||
|
||||
Pattern 16 (?,?,?,?) does not occur in practice since collection is always specified, reducing it to pattern 12.
|
||||
|
||||
### Common Query Examples
|
||||
|
||||
**Everything about an entity:**
|
||||
|
||||
```sql
|
||||
SELECT * FROM quads_by_entity
|
||||
WHERE collection = 'tenant1' AND entity = 'https://example.org/Alice';
|
||||
```
|
||||
|
||||
**All outgoing relationships for an entity:**
|
||||
|
||||
```sql
|
||||
SELECT * FROM quads_by_entity
|
||||
WHERE collection = 'tenant1' AND entity = 'https://example.org/Alice'
|
||||
AND role = 'S';
|
||||
```
|
||||
|
||||
**Specific predicate for an entity:**
|
||||
|
||||
```sql
|
||||
SELECT * FROM quads_by_entity
|
||||
WHERE collection = 'tenant1' AND entity = 'https://example.org/Alice'
|
||||
AND role = 'S' AND p = 'https://example.org/knows';
|
||||
```
|
||||
|
||||
**Label for an entity (specific language):**
|
||||
|
||||
```sql
|
||||
SELECT * FROM quads_by_entity
|
||||
WHERE collection = 'tenant1' AND entity = 'https://example.org/Alice'
|
||||
AND role = 'S' AND p = 'http://www.w3.org/2000/01/rdf-schema#label'
|
||||
AND otype = 'L';
|
||||
```
|
||||
|
||||
Then filter by `lang = 'en'` application-side if needed.
|
||||
|
||||
**Only URI-valued relationships (entity-to-entity links):**
|
||||
|
||||
```sql
|
||||
SELECT * FROM quads_by_entity
|
||||
WHERE collection = 'tenant1' AND entity = 'https://example.org/Alice'
|
||||
AND role = 'S' AND p = 'https://example.org/knows' AND otype = 'U';
|
||||
```
|
||||
|
||||
**Reverse lookup — what points to this entity:**
|
||||
|
||||
```sql
|
||||
SELECT * FROM quads_by_entity
|
||||
WHERE collection = 'tenant1' AND entity = 'https://example.org/Bob'
|
||||
AND role = 'O';
|
||||
```
|
||||
|
||||
## Label Resolution and Cache Warming
|
||||
|
||||
One of the most significant advantages of the entity-centric model is that **label resolution becomes a free side effect**.
|
||||
|
||||
In the traditional multi-table model, fetching labels requires separate round-trip queries: retrieve triples, identify entity URIs in the results, then fetch `rdfs:label` for each. This N+1 pattern is expensive.
|
||||
|
||||
In the entity-centric model, querying an entity returns **all** its quads — including its labels, types, and other properties. When the application caches query results, labels are pre-warmed before anything asks for them.
|
||||
|
||||
Two usage regimes confirm this works well in practice:
|
||||
|
||||
- **Human-facing queries**: naturally small result sets, labels essential. Entity reads pre-warm the cache.
|
||||
- **AI/bulk queries**: large result sets with hard limits. Labels either unnecessary or needed only for a curated subset of entities already in cache.
|
||||
|
||||
The theoretical concern of resolving labels for huge result sets (e.g. 30,000 entities) is mitigated by the practical observation that no human or AI consumer usefully processes that many labels. Application-level query limits ensure cache pressure remains manageable.
|
||||
|
||||
## Wide Partitions and Reification
|
||||
|
||||
Reification (RDF-star style statements about statements) creates hub entities — e.g. a source document that supports thousands of extracted facts. This can produce wide partitions.
|
||||
|
||||
Mitigating factors:
|
||||
|
||||
- **Application-level query limits**: all GraphRAG and human-facing queries enforce hard limits, so wide partitions are never fully scanned on the hot read path
|
||||
- **Cassandra handles partial reads efficiently**: a clustering column scan with an early stop is fast even on large partitions
|
||||
- **Collection deletion** (the only operation that might traverse full partitions) is an accepted background process
|
||||
|
||||
## Collection Deletion
|
||||
|
||||
Triggered by API call, runs in the background (eventually consistent).
|
||||
|
||||
1. Read `quads_by_collection` for the target collection to get all quads
|
||||
2. Extract unique entities from the quads (s, p, o, d values)
|
||||
3. For each unique entity, delete the partition from `quads_by_entity`
|
||||
4. Delete the rows from `quads_by_collection`
|
||||
|
||||
The `quads_by_collection` table provides the index needed to locate all entity partitions without a full table scan. Partition-level deletes are efficient since `(collection, entity)` is the partition key.
|
||||
|
||||
## Migration Path from Multi-Table Model
|
||||
|
||||
The entity-centric model can coexist with the existing multi-table model during migration:
|
||||
|
||||
1. Deploy `quads_by_entity` and `quads_by_collection` tables alongside existing tables
|
||||
2. Dual-write new quads to both old and new tables
|
||||
3. Backfill existing data into the new tables
|
||||
4. Migrate read paths one query pattern at a time
|
||||
5. Decommission old tables once all reads are migrated
|
||||
|
||||
## Summary
|
||||
|
||||
| Aspect | Traditional (6-table) | Entity-centric (2-table) |
|
||||
|---|---|---|
|
||||
| Tables | 7+ | 2 |
|
||||
| Writes per quad | 6+ | 5 (4 data + 1 manifest) |
|
||||
| Label resolution | Separate round trips | Free via cache warming |
|
||||
| Query patterns | 16 across 6 tables | 16 on 1 table |
|
||||
| Schema complexity | High | Low |
|
||||
| Operational overhead | 6 tables to tune/repair | 1 data table |
|
||||
| Reification support | Additional complexity | Natural fit |
|
||||
| Object type filtering | Not available | Native (via otype clustering) |
|
||||
|
||||
573
docs/tech-specs/graph-contexts.md
Normal file
573
docs/tech-specs/graph-contexts.md
Normal file
|
|
@ -0,0 +1,573 @@
|
|||
# Graph Contexts Technical Specification
|
||||
|
||||
## Overview
|
||||
|
||||
This specification describes changes to TrustGraph's core graph primitives to
|
||||
align with RDF 1.2 and support full RDF Dataset semantics. This is a breaking
|
||||
change for the 2.x release series.
|
||||
|
||||
### Versioning
|
||||
|
||||
- **2.0**: Early adopter release. Core features available, may not be fully
|
||||
production-ready.
|
||||
- **2.1 / 2.2**: Production release. Stability and completeness validated.
|
||||
|
||||
Flexibility on maturity is intentional - early adopters can access new
|
||||
capabilities before all features are production-hardened.
|
||||
|
||||
## Goals
|
||||
|
||||
The primary goals for this work are to enable metadata about facts/statements:
|
||||
|
||||
- **Temporal information**: Associate facts with time metadata
|
||||
- When a fact was believed to be true
|
||||
- When a fact became true
|
||||
- When a fact was discovered to be false
|
||||
|
||||
- **Provenance/Sources**: Track which sources support a fact
|
||||
- "This fact was supported by source X"
|
||||
- Link facts back to their origin documents
|
||||
|
||||
- **Veracity/Trust**: Record assertions about truth
|
||||
- "Person P asserted this was true"
|
||||
- "Person Q claims this is false"
|
||||
- Enable trust scoring and conflict detection
|
||||
|
||||
**Hypothesis**: Reification (RDF-star / quoted triples) is the key mechanism
|
||||
to achieve these outcomes, as all require making statements about statements.
|
||||
|
||||
## Background
|
||||
|
||||
To express "the fact (Alice knows Bob) was discovered on 2024-01-15" or
|
||||
"source X supports the claim (Y causes Z)", you need to reference an edge
|
||||
as a thing you can make statements about. Standard triples don't support this.
|
||||
|
||||
### Current Limitations
|
||||
|
||||
The current `Value` class in `trustgraph-base/trustgraph/schema/core/primitives.py`
|
||||
can represent:
|
||||
- URI nodes (`is_uri=True`)
|
||||
- Literal values (`is_uri=False`)
|
||||
|
||||
The `type` field exists but is not used to represent XSD datatypes.
|
||||
|
||||
## Technical Design
|
||||
|
||||
### RDF Features to Support
|
||||
|
||||
#### Core Features (Related to Reification Goals)
|
||||
|
||||
These features are directly related to the temporal, provenance, and veracity
|
||||
goals:
|
||||
|
||||
1. **RDF 1.2 Quoted Triples (RDF-star)**
|
||||
- Edges that point at other edges
|
||||
- A Triple can appear as the subject or object of another Triple
|
||||
- Enables statements about statements (reification)
|
||||
- Core mechanism for annotating individual facts
|
||||
|
||||
2. **RDF Dataset / Named Graphs**
|
||||
- Support for multiple named graphs within a dataset
|
||||
- Each graph identified by an IRI
|
||||
- Moves from triples (s, p, o) to quads (s, p, o, g)
|
||||
- Includes a default graph plus zero or more named graphs
|
||||
- The graph IRI can be a subject in statements, e.g.:
|
||||
```
|
||||
<graph-source-A> <discoveredOn> "2024-01-15"
|
||||
<graph-source-A> <hasVeracity> "high"
|
||||
```
|
||||
- Note: Named graphs are a separate feature from reification. They have
|
||||
uses beyond statement annotation (partitioning, access control, dataset
|
||||
organization) and should be treated as a distinct capability.
|
||||
|
||||
3. **Blank Nodes** (Limited Support)
|
||||
- Anonymous nodes without a global URI
|
||||
- Supported for compatibility when loading external RDF data
|
||||
- **Limited status**: No guarantees about stable identity after loading
|
||||
- Find them via wildcard queries (match by connections, not by ID)
|
||||
- Not a first-class feature - don't rely on precise blank node handling
|
||||
|
||||
#### Opportunistic Fixes (2.0 Breaking Change)
|
||||
|
||||
These features are not directly related to the reification goals but are
|
||||
valuable improvements to include while making breaking changes:
|
||||
|
||||
4. **Literal Datatypes**
|
||||
- Properly use the `type` field for XSD datatypes
|
||||
- Examples: xsd:string, xsd:integer, xsd:dateTime, etc.
|
||||
- Fixes current limitation: cannot represent dates or integers properly
|
||||
|
||||
5. **Language Tags**
|
||||
- Support for language attributes on string literals (@en, @fr, etc.)
|
||||
- Note: A literal has either a language tag OR a datatype, not both
|
||||
(except for rdf:langString)
|
||||
- Important for AI/multilingual use cases
|
||||
|
||||
### Data Models
|
||||
|
||||
#### Term (rename from Value)
|
||||
|
||||
The `Value` class will be renamed to `Term` to better reflect RDF terminology.
|
||||
This rename serves two purposes:
|
||||
1. Aligns naming with RDF concepts (a "Term" can be an IRI, literal, blank
|
||||
node, or quoted triple - not just a "value")
|
||||
2. Forces code review at the breaking change interface - any code still
|
||||
referencing `Value` is visibly broken and needs updating
|
||||
|
||||
A Term can represent:
|
||||
|
||||
- **IRI/URI** - A named node/resource
|
||||
- **Blank Node** - An anonymous node with local scope
|
||||
- **Literal** - A data value with either:
|
||||
- A datatype (XSD type), OR
|
||||
- A language tag
|
||||
- **Quoted Triple** - A triple used as a term (RDF 1.2)
|
||||
|
||||
##### Chosen Approach: Single Class with Type Discriminator
|
||||
|
||||
Serialization requirements drive the structure - a type discriminator is needed
|
||||
in the wire format regardless of the Python representation. A single class with
|
||||
a type field is the natural fit and aligns with the current `Value` pattern.
|
||||
|
||||
Single-character type codes provide compact serialization:
|
||||
|
||||
```python
|
||||
from dataclasses import dataclass
|
||||
|
||||
# Term type constants
|
||||
IRI = "i" # IRI/URI node
|
||||
BLANK = "b" # Blank node
|
||||
LITERAL = "l" # Literal value
|
||||
TRIPLE = "t" # Quoted triple (RDF-star)
|
||||
|
||||
@dataclass
|
||||
class Term:
|
||||
type: str = "" # One of: IRI, BLANK, LITERAL, TRIPLE
|
||||
|
||||
# For IRI terms (type == IRI)
|
||||
iri: str = ""
|
||||
|
||||
# For blank nodes (type == BLANK)
|
||||
id: str = ""
|
||||
|
||||
# For literals (type == LITERAL)
|
||||
value: str = ""
|
||||
datatype: str = "" # XSD datatype URI (mutually exclusive with language)
|
||||
language: str = "" # Language tag (mutually exclusive with datatype)
|
||||
|
||||
# For quoted triples (type == TRIPLE)
|
||||
triple: "Triple | None" = None
|
||||
```
|
||||
|
||||
Usage examples:
|
||||
|
||||
```python
|
||||
# IRI term
|
||||
node = Term(type=IRI, iri="http://example.org/Alice")
|
||||
|
||||
# Literal with datatype
|
||||
age = Term(type=LITERAL, value="42", datatype="xsd:integer")
|
||||
|
||||
# Literal with language tag
|
||||
label = Term(type=LITERAL, value="Hello", language="en")
|
||||
|
||||
# Blank node
|
||||
anon = Term(type=BLANK, id="_:b1")
|
||||
|
||||
# Quoted triple (statement about a statement)
|
||||
inner = Triple(
|
||||
s=Term(type=IRI, iri="http://example.org/Alice"),
|
||||
p=Term(type=IRI, iri="http://example.org/knows"),
|
||||
o=Term(type=IRI, iri="http://example.org/Bob"),
|
||||
)
|
||||
reified = Term(type=TRIPLE, triple=inner)
|
||||
```
|
||||
|
||||
##### Alternatives Considered
|
||||
|
||||
**Option B: Union of specialized classes** (`Term = IRI | BlankNode | Literal | QuotedTriple`)
|
||||
- Rejected: Serialization would still need a type discriminator, adding complexity
|
||||
|
||||
**Option C: Base class with subclasses**
|
||||
- Rejected: Same serialization issue, plus dataclass inheritance quirks
|
||||
|
||||
#### Triple / Quad
|
||||
|
||||
The `Triple` class gains an optional graph field to become a quad:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class Triple:
|
||||
s: Term | None = None # Subject
|
||||
p: Term | None = None # Predicate
|
||||
o: Term | None = None # Object
|
||||
g: str | None = None # Graph name (IRI), None = default graph
|
||||
```
|
||||
|
||||
Design decisions:
|
||||
- **Field name**: `g` for consistency with `s`, `p`, `o`
|
||||
- **Optional**: `None` means the default graph (unnamed)
|
||||
- **Type**: Plain string (IRI) rather than Term
|
||||
- Graph names are always IRIs
|
||||
- Blank nodes as graph names ruled out (too confusing)
|
||||
- No need for the full Term machinery
|
||||
|
||||
Note: The class name stays `Triple` even though it's technically a quad now.
|
||||
This avoids churn and "triple" is still the common terminology for the s/p/o
|
||||
portion. The graph context is metadata about where the triple lives.
|
||||
|
||||
### Candidate Query Patterns
|
||||
|
||||
The current query engine accepts combinations of S, P, O terms. With quoted
|
||||
triples, a triple itself becomes a valid term in those positions. Below are
|
||||
candidate query patterns that support the original goals.
|
||||
|
||||
#### Graph Parameter Semantics
|
||||
|
||||
Following SPARQL conventions for backward compatibility:
|
||||
|
||||
- **`g` omitted / None**: Query the default graph only
|
||||
- **`g` = specific IRI**: Query that named graph only
|
||||
- **`g` = wildcard / `*`**: Query across all graphs (equivalent to SPARQL
|
||||
`GRAPH ?g { ... }`)
|
||||
|
||||
This keeps simple queries simple and makes named graph queries opt-in.
|
||||
|
||||
Cross-graph queries (g=wildcard) are fully supported. The Cassandra schema
|
||||
includes dedicated tables (SPOG, POSG, OSPG) where g is a clustering column
|
||||
rather than a partition key, enabling efficient queries across all graphs.
|
||||
|
||||
#### Temporal Queries
|
||||
|
||||
**Find all facts discovered after a given date:**
|
||||
```
|
||||
S: ? # any quoted triple
|
||||
P: <discoveredOn>
|
||||
O: > "2024-01-15"^^xsd:date # date comparison
|
||||
```
|
||||
|
||||
**Find when a specific fact was believed true:**
|
||||
```
|
||||
S: << <Alice> <knows> <Bob> >> # quoted triple as subject
|
||||
P: <believedTrueFrom>
|
||||
O: ? # returns the date
|
||||
```
|
||||
|
||||
**Find facts that became false:**
|
||||
```
|
||||
S: ? # any quoted triple
|
||||
P: <discoveredFalseOn>
|
||||
O: ? # has any value (exists)
|
||||
```
|
||||
|
||||
#### Provenance Queries
|
||||
|
||||
**Find all facts supported by a specific source:**
|
||||
```
|
||||
S: ? # any quoted triple
|
||||
P: <supportedBy>
|
||||
O: <source:document-123>
|
||||
```
|
||||
|
||||
**Find which sources support a specific fact:**
|
||||
```
|
||||
S: << <DrugA> <treats> <DiseaseB> >> # quoted triple as subject
|
||||
P: <supportedBy>
|
||||
O: ? # returns source IRIs
|
||||
```
|
||||
|
||||
#### Veracity Queries
|
||||
|
||||
**Find assertions a person marked as true:**
|
||||
```
|
||||
S: ? # any quoted triple
|
||||
P: <assertedTrueBy>
|
||||
O: <person:Alice>
|
||||
```
|
||||
|
||||
**Find conflicting assertions (same fact, different veracity):**
|
||||
```
|
||||
# First query: facts asserted true
|
||||
S: ?
|
||||
P: <assertedTrueBy>
|
||||
O: ?
|
||||
|
||||
# Second query: facts asserted false
|
||||
S: ?
|
||||
P: <assertedFalseBy>
|
||||
O: ?
|
||||
|
||||
# Application logic: find intersection of subjects
|
||||
```
|
||||
|
||||
**Find facts with trust score below threshold:**
|
||||
```
|
||||
S: ? # any quoted triple
|
||||
P: <trustScore>
|
||||
O: < 0.5 # numeric comparison
|
||||
```
|
||||
|
||||
### Architecture
|
||||
|
||||
Significant changes required across multiple components:
|
||||
|
||||
#### This Repository (trustgraph)
|
||||
|
||||
- **Schema primitives** (`trustgraph-base/trustgraph/schema/core/primitives.py`)
|
||||
- Value → Term rename
|
||||
- New Term structure with type discriminator
|
||||
- Triple gains `g` field for graph context
|
||||
|
||||
- **Message translators** (`trustgraph-base/trustgraph/messaging/translators/`)
|
||||
- Update for new Term/Triple structures
|
||||
- Serialization/deserialization for new fields
|
||||
|
||||
- **Gateway components**
|
||||
- Handle new Term and quad structures
|
||||
|
||||
- **Knowledge cores**
|
||||
- Core changes to support quads and reification
|
||||
|
||||
- **Knowledge manager**
|
||||
- Schema changes propagate here
|
||||
|
||||
- **Storage layers**
|
||||
- Cassandra: Schema redesign (see Implementation Details)
|
||||
- Other backends: Deferred to later phases
|
||||
|
||||
- **Command-line utilities**
|
||||
- Update for new data structures
|
||||
|
||||
- **REST API documentation**
|
||||
- OpenAPI spec updates
|
||||
|
||||
#### External Repositories
|
||||
|
||||
- **Python API** (this repo)
|
||||
- Client library updates for new structures
|
||||
|
||||
- **TypeScript APIs** (separate repo)
|
||||
- Client library updates
|
||||
|
||||
- **Workbench** (separate repo)
|
||||
- Significant state management changes
|
||||
|
||||
### APIs
|
||||
|
||||
#### REST API
|
||||
|
||||
- Documented in OpenAPI spec
|
||||
- Will need updates for new Term/Triple structures
|
||||
- New endpoints may be needed for graph context operations
|
||||
|
||||
#### Python API (this repo)
|
||||
|
||||
- Client library changes to match new primitives
|
||||
- Breaking changes to Term (was Value) and Triple
|
||||
|
||||
#### TypeScript API (separate repo)
|
||||
|
||||
- Parallel changes to Python API
|
||||
- Separate release coordination
|
||||
|
||||
#### Workbench (separate repo)
|
||||
|
||||
- Significant state management changes
|
||||
- UI updates for graph context features
|
||||
|
||||
### Implementation Details
|
||||
|
||||
#### Phased Storage Implementation
|
||||
|
||||
Multiple graph store backends exist (Cassandra, Neo4j, etc.). Implementation
|
||||
will proceed in phases:
|
||||
|
||||
1. **Phase 1: Cassandra**
|
||||
- Start with the home-grown Cassandra store
|
||||
- Full control over the storage layer enables rapid iteration
|
||||
- Schema will be redesigned from scratch for quads + reification
|
||||
- Validate the data model and query patterns against real use cases
|
||||
|
||||
#### Cassandra Schema Design
|
||||
|
||||
Cassandra requires multiple tables to support different query access patterns
|
||||
(each table efficiently queries by its partition key + clustering columns).
|
||||
|
||||
##### Query Patterns
|
||||
|
||||
With quads (g, s, p, o), each position can be specified or wildcard, giving
|
||||
16 possible query patterns:
|
||||
|
||||
| # | g | s | p | o | Description |
|
||||
|---|---|---|---|---|-------------|
|
||||
| 1 | ? | ? | ? | ? | All quads |
|
||||
| 2 | ? | ? | ? | o | By object |
|
||||
| 3 | ? | ? | p | ? | By predicate |
|
||||
| 4 | ? | ? | p | o | By predicate + object |
|
||||
| 5 | ? | s | ? | ? | By subject |
|
||||
| 6 | ? | s | ? | o | By subject + object |
|
||||
| 7 | ? | s | p | ? | By subject + predicate |
|
||||
| 8 | ? | s | p | o | Full triple (which graphs?) |
|
||||
| 9 | g | ? | ? | ? | By graph |
|
||||
| 10 | g | ? | ? | o | By graph + object |
|
||||
| 11 | g | ? | p | ? | By graph + predicate |
|
||||
| 12 | g | ? | p | o | By graph + predicate + object |
|
||||
| 13 | g | s | ? | ? | By graph + subject |
|
||||
| 14 | g | s | ? | o | By graph + subject + object |
|
||||
| 15 | g | s | p | ? | By graph + subject + predicate |
|
||||
| 16 | g | s | p | o | Exact quad |
|
||||
|
||||
##### Table Design
|
||||
|
||||
Cassandra constraint: You can only efficiently query by partition key, then
|
||||
filter on clustering columns left-to-right. For g-wildcard queries, g must be
|
||||
a clustering column. For g-specified queries, g in the partition key is more
|
||||
efficient.
|
||||
|
||||
**Two table families needed:**
|
||||
|
||||
**Family A: g-wildcard queries** (g in clustering columns)
|
||||
|
||||
| Table | Partition | Clustering | Supports patterns |
|
||||
|-------|-----------|------------|-------------------|
|
||||
| SPOG | (user, collection, s) | p, o, g | 5, 7, 8 |
|
||||
| POSG | (user, collection, p) | o, s, g | 3, 4 |
|
||||
| OSPG | (user, collection, o) | s, p, g | 2, 6 |
|
||||
|
||||
**Family B: g-specified queries** (g in partition key)
|
||||
|
||||
| Table | Partition | Clustering | Supports patterns |
|
||||
|-------|-----------|------------|-------------------|
|
||||
| GSPO | (user, collection, g, s) | p, o | 9, 13, 15, 16 |
|
||||
| GPOS | (user, collection, g, p) | o, s | 11, 12 |
|
||||
| GOSP | (user, collection, g, o) | s, p | 10, 14 |
|
||||
|
||||
**Collection table** (for iteration and bulk deletion)
|
||||
|
||||
| Table | Partition | Clustering | Purpose |
|
||||
|-------|-----------|------------|---------|
|
||||
| COLL | (user, collection) | g, s, p, o | Enumerate all quads in collection |
|
||||
|
||||
##### Write and Delete Paths
|
||||
|
||||
**Write path**: Insert into all 7 tables.
|
||||
|
||||
**Delete collection path**:
|
||||
1. Iterate COLL table for `(user, collection)`
|
||||
2. For each quad, delete from all 6 query tables
|
||||
3. Delete from COLL table (or range delete)
|
||||
|
||||
**Delete single quad path**: Delete from all 7 tables directly.
|
||||
|
||||
##### Storage Cost
|
||||
|
||||
Each quad is stored 7 times. This is the cost of flexible querying combined
|
||||
with efficient collection deletion.
|
||||
|
||||
##### Quoted Triples in Storage
|
||||
|
||||
Subject or object can be a triple itself. Options:
|
||||
|
||||
**Option A: Serialize quoted triples to canonical string**
|
||||
```
|
||||
S: "<<http://ex/Alice|http://ex/knows|http://ex/Bob>>"
|
||||
P: http://ex/discoveredOn
|
||||
O: "2024-01-15"
|
||||
G: null
|
||||
```
|
||||
- Store quoted triple as serialized string in S or O columns
|
||||
- Query by exact match on serialized form
|
||||
- Pro: Simple, fits existing index patterns
|
||||
- Con: Can't query "find triples where quoted subject's predicate is X"
|
||||
|
||||
**Option B: Triple IDs / Hashes**
|
||||
```
|
||||
Triple table:
|
||||
id: hash(s,p,o,g)
|
||||
s, p, o, g: ...
|
||||
|
||||
Metadata table:
|
||||
subject_triple_id: <hash>
|
||||
p: http://ex/discoveredOn
|
||||
o: "2024-01-15"
|
||||
```
|
||||
- Assign each triple an ID (hash of components)
|
||||
- Reification metadata references triples by ID
|
||||
- Pro: Clean separation, can index triple IDs
|
||||
- Con: Requires computing/managing triple identity, two-phase lookups
|
||||
|
||||
**Recommendation**: Start with Option A (serialized strings) for simplicity.
|
||||
Option B may be needed if advanced query patterns over quoted triple
|
||||
components are required.
|
||||
|
||||
2. **Phase 2+: Other Backends**
|
||||
- Neo4j and other stores implemented in subsequent stages
|
||||
- Lessons learned from Cassandra inform these implementations
|
||||
|
||||
This approach de-risks the design by validating on a fully-controlled backend
|
||||
before committing to implementations across all stores.
|
||||
|
||||
#### Value → Term Rename
|
||||
|
||||
The `Value` class will be renamed to `Term`. This affects ~78 files across
|
||||
the codebase. The rename acts as a forcing function: any code still using
|
||||
`Value` is immediately identifiable as needing review/update for 2.0
|
||||
compatibility.
|
||||
|
||||
## Security Considerations
|
||||
|
||||
Named graphs are not a security feature. Users and collections remain the
|
||||
security boundaries. Named graphs are purely for data organization and
|
||||
reification support.
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- Quoted triples add nesting depth - may impact query performance
|
||||
- Named graph indexing strategies needed for efficient graph-scoped queries
|
||||
- Cassandra schema design will need to accommodate quad storage efficiently
|
||||
|
||||
### Vector Store Boundary
|
||||
|
||||
Vector stores always reference IRIs only:
|
||||
- Never edges (quoted triples)
|
||||
- Never literal values
|
||||
- Never blank nodes
|
||||
|
||||
This keeps the vector store simple - it handles semantic similarity of named
|
||||
entities. The graph structure handles relationships, reification, and metadata.
|
||||
Quoted triples and named graphs don't complicate vector operations.
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
Use existing test strategy. As this is a breaking change, extensive focus on
|
||||
the end-to-end test suite to validate the new structures work correctly across
|
||||
all components.
|
||||
|
||||
## Migration Plan
|
||||
|
||||
- 2.0 is a breaking release; no backward compatibility required
|
||||
- Existing data may need migration to new schema (TBD based on final design)
|
||||
- Consider migration tooling for converting existing triples
|
||||
|
||||
## Open Questions
|
||||
|
||||
- **Blank nodes**: Limited support confirmed. May need to decide on
|
||||
skolemization strategy (generate IRIs on load, or preserve blank node IDs).
|
||||
- **Query syntax**: What is the concrete syntax for specifying quoted triples
|
||||
in queries? Need to define the query API.
|
||||
- ~~**Predicate vocabulary**~~: Resolved. Any valid RDF predicates permitted,
|
||||
including custom user-defined. Minimal assumptions about RDF validity.
|
||||
Very few locked-in values (e.g., `rdfs:label` used in some places).
|
||||
Strategy: avoid locking anything in unless absolutely necessary.
|
||||
- ~~**Vector store impact**~~: Resolved. Vector stores always point to IRIs
|
||||
only - never edges, literals, or blank nodes. Quoted triples and
|
||||
reification don't affect the vector store.
|
||||
- ~~**Named graph semantics**~~: Resolved. Queries default to the default
|
||||
graph (matches SPARQL behavior, backward compatible). Explicit graph
|
||||
parameter required to query named graphs or all graphs.
|
||||
|
||||
## References
|
||||
|
||||
- [RDF 1.2 Concepts](https://www.w3.org/TR/rdf12-concepts/)
|
||||
- [RDF-star and SPARQL-star](https://w3c.github.io/rdf-star/)
|
||||
- [RDF Dataset](https://www.w3.org/TR/rdf11-concepts/#section-dataset)
|
||||
455
docs/tech-specs/jsonl-prompt-output.md
Normal file
455
docs/tech-specs/jsonl-prompt-output.md
Normal file
|
|
@ -0,0 +1,455 @@
|
|||
# JSONL Prompt Output Technical Specification
|
||||
|
||||
## Overview
|
||||
|
||||
This specification describes the implementation of JSONL (JSON Lines) output
|
||||
format for prompt responses in TrustGraph. JSONL enables truncation-resilient
|
||||
extraction of structured data from LLM responses, addressing critical issues
|
||||
with JSON array outputs being corrupted when LLM responses hit output token
|
||||
limits.
|
||||
|
||||
This implementation supports the following use cases:
|
||||
|
||||
1. **Truncation-Resilient Extraction**: Extract valid partial results even when
|
||||
LLM output is truncated mid-response
|
||||
2. **Large-Scale Extraction**: Handle extraction of many items without risk of
|
||||
complete failure due to token limits
|
||||
3. **Mixed-Type Extraction**: Support extraction of multiple entity types
|
||||
(definitions, relationships, entities, attributes) in a single prompt
|
||||
4. **Streaming-Compatible Output**: Enable future streaming/incremental
|
||||
processing of extraction results
|
||||
|
||||
## Goals
|
||||
|
||||
- **Backward Compatibility**: Existing prompts using `response-type: "text"` and
|
||||
`response-type: "json"` continue to work without modification
|
||||
- **Truncation Resilience**: Partial LLM outputs yield partial valid results
|
||||
rather than complete failure
|
||||
- **Schema Validation**: Support JSON Schema validation for individual objects
|
||||
- **Discriminated Unions**: Support mixed-type outputs using a `type` field
|
||||
discriminator
|
||||
- **Minimal API Changes**: Extend existing prompt configuration with new
|
||||
response type and schema key
|
||||
|
||||
## Background
|
||||
|
||||
### Current Architecture
|
||||
|
||||
The prompt service supports two response types:
|
||||
|
||||
1. `response-type: "text"` - Raw text response returned as-is
|
||||
2. `response-type: "json"` - JSON parsed from response, validated against
|
||||
optional `schema`
|
||||
|
||||
Current implementation in `trustgraph-flow/trustgraph/template/prompt_manager.py`:
|
||||
|
||||
```python
|
||||
class Prompt:
|
||||
def __init__(self, template, response_type = "text", terms=None, schema=None):
|
||||
self.template = template
|
||||
self.response_type = response_type
|
||||
self.terms = terms
|
||||
self.schema = schema
|
||||
```
|
||||
|
||||
### Current Limitations
|
||||
|
||||
When extraction prompts request output as JSON arrays (`[{...}, {...}, ...]`):
|
||||
|
||||
- **Truncation corruption**: If the LLM hits output token limits mid-array, the
|
||||
entire response becomes invalid JSON and cannot be parsed
|
||||
- **All-or-nothing parsing**: Must receive complete output before parsing
|
||||
- **No partial results**: A truncated response yields zero usable data
|
||||
- **Unreliable for large extractions**: More extracted items = higher failure risk
|
||||
|
||||
This specification addresses these limitations by introducing JSONL format for
|
||||
extraction prompts, where each extracted item is a complete JSON object on its
|
||||
own line.
|
||||
|
||||
## Technical Design
|
||||
|
||||
### Response Type Extension
|
||||
|
||||
Add a new response type `"jsonl"` alongside existing `"text"` and `"json"` types.
|
||||
|
||||
#### Configuration Changes
|
||||
|
||||
**New response type value:**
|
||||
|
||||
```
|
||||
"response-type": "jsonl"
|
||||
```
|
||||
|
||||
**Schema interpretation:**
|
||||
|
||||
The existing `"schema"` key is used for both `"json"` and `"jsonl"` response
|
||||
types. The interpretation depends on the response type:
|
||||
|
||||
- `"json"`: Schema describes the entire response (typically an array or object)
|
||||
- `"jsonl"`: Schema describes each individual line/object
|
||||
|
||||
```json
|
||||
{
|
||||
"response-type": "jsonl",
|
||||
"schema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"entity": { "type": "string" },
|
||||
"definition": { "type": "string" }
|
||||
},
|
||||
"required": ["entity", "definition"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This avoids changes to prompt configuration tooling and editors.
|
||||
|
||||
### JSONL Format Specification
|
||||
|
||||
#### Simple Extraction
|
||||
|
||||
For prompts extracting a single type of object (definitions, relationships,
|
||||
topics, rows), the output is one JSON object per line with no wrapper:
|
||||
|
||||
**Prompt output format:**
|
||||
```
|
||||
{"entity": "photosynthesis", "definition": "Process by which plants convert sunlight"}
|
||||
{"entity": "chlorophyll", "definition": "Green pigment in plants"}
|
||||
{"entity": "mitochondria", "definition": "Powerhouse of the cell"}
|
||||
```
|
||||
|
||||
**Contrast with previous JSON array format:**
|
||||
```json
|
||||
[
|
||||
{"entity": "photosynthesis", "definition": "Process by which plants convert sunlight"},
|
||||
{"entity": "chlorophyll", "definition": "Green pigment in plants"},
|
||||
{"entity": "mitochondria", "definition": "Powerhouse of the cell"}
|
||||
]
|
||||
```
|
||||
|
||||
If the LLM truncates after line 2, the JSON array format yields invalid JSON,
|
||||
while JSONL yields two valid objects.
|
||||
|
||||
#### Mixed-Type Extraction (Discriminated Unions)
|
||||
|
||||
For prompts extracting multiple types of objects (e.g., both definitions and
|
||||
relationships, or entities, relationships, and attributes), use a `"type"`
|
||||
field as discriminator:
|
||||
|
||||
**Prompt output format:**
|
||||
```
|
||||
{"type": "definition", "entity": "DNA", "definition": "Molecule carrying genetic instructions"}
|
||||
{"type": "relationship", "subject": "DNA", "predicate": "located_in", "object": "cell nucleus", "object-entity": true}
|
||||
{"type": "definition", "entity": "RNA", "definition": "Molecule that carries genetic information"}
|
||||
{"type": "relationship", "subject": "RNA", "predicate": "transcribed_from", "object": "DNA", "object-entity": true}
|
||||
```
|
||||
|
||||
**Schema for discriminated unions uses `oneOf`:**
|
||||
```json
|
||||
{
|
||||
"response-type": "jsonl",
|
||||
"schema": {
|
||||
"oneOf": [
|
||||
{
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"type": { "const": "definition" },
|
||||
"entity": { "type": "string" },
|
||||
"definition": { "type": "string" }
|
||||
},
|
||||
"required": ["type", "entity", "definition"]
|
||||
},
|
||||
{
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"type": { "const": "relationship" },
|
||||
"subject": { "type": "string" },
|
||||
"predicate": { "type": "string" },
|
||||
"object": { "type": "string" },
|
||||
"object-entity": { "type": "boolean" }
|
||||
},
|
||||
"required": ["type", "subject", "predicate", "object", "object-entity"]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Ontology Extraction
|
||||
|
||||
For ontology-based extraction with entities, relationships, and attributes:
|
||||
|
||||
**Prompt output format:**
|
||||
```
|
||||
{"type": "entity", "entity": "Cornish pasty", "entity_type": "fo/Recipe"}
|
||||
{"type": "entity", "entity": "beef", "entity_type": "fo/Food"}
|
||||
{"type": "relationship", "subject": "Cornish pasty", "subject_type": "fo/Recipe", "relation": "fo/has_ingredient", "object": "beef", "object_type": "fo/Food"}
|
||||
{"type": "attribute", "entity": "Cornish pasty", "entity_type": "fo/Recipe", "attribute": "fo/serves", "value": "4 people"}
|
||||
```
|
||||
|
||||
### Implementation Details
|
||||
|
||||
#### Prompt Class
|
||||
|
||||
The existing `Prompt` class requires no changes. The `schema` field is reused
|
||||
for JSONL, with its interpretation determined by `response_type`:
|
||||
|
||||
```python
|
||||
class Prompt:
|
||||
def __init__(self, template, response_type="text", terms=None, schema=None):
|
||||
self.template = template
|
||||
self.response_type = response_type
|
||||
self.terms = terms
|
||||
self.schema = schema # Interpretation depends on response_type
|
||||
```
|
||||
|
||||
#### PromptManager.load_config
|
||||
|
||||
No changes required - existing configuration loading already handles the
|
||||
`schema` key.
|
||||
|
||||
#### JSONL Parsing
|
||||
|
||||
Add a new parsing method for JSONL responses:
|
||||
|
||||
```python
|
||||
def parse_jsonl(self, text):
|
||||
"""
|
||||
Parse JSONL response, returning list of valid objects.
|
||||
|
||||
Invalid lines (malformed JSON, empty lines) are skipped with warnings.
|
||||
This provides truncation resilience - partial output yields partial results.
|
||||
"""
|
||||
results = []
|
||||
|
||||
for line_num, line in enumerate(text.strip().split('\n'), 1):
|
||||
line = line.strip()
|
||||
|
||||
# Skip empty lines
|
||||
if not line:
|
||||
continue
|
||||
|
||||
# Skip markdown code fence markers if present
|
||||
if line.startswith('```'):
|
||||
continue
|
||||
|
||||
try:
|
||||
obj = json.loads(line)
|
||||
results.append(obj)
|
||||
except json.JSONDecodeError as e:
|
||||
# Log warning but continue - this provides truncation resilience
|
||||
logger.warning(f"JSONL parse error on line {line_num}: {e}")
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
#### PromptManager.invoke Changes
|
||||
|
||||
Extend the invoke method to handle the new response type:
|
||||
|
||||
```python
|
||||
async def invoke(self, id, input, llm):
|
||||
logger.debug("Invoking prompt template...")
|
||||
|
||||
terms = self.terms | self.prompts[id].terms | input
|
||||
resp_type = self.prompts[id].response_type
|
||||
|
||||
prompt = {
|
||||
"system": self.system_template.render(terms),
|
||||
"prompt": self.render(id, input)
|
||||
}
|
||||
|
||||
resp = await llm(**prompt)
|
||||
|
||||
if resp_type == "text":
|
||||
return resp
|
||||
|
||||
if resp_type == "json":
|
||||
try:
|
||||
obj = self.parse_json(resp)
|
||||
except:
|
||||
logger.error(f"JSON parse failed: {resp}")
|
||||
raise RuntimeError("JSON parse fail")
|
||||
|
||||
if self.prompts[id].schema:
|
||||
try:
|
||||
validate(instance=obj, schema=self.prompts[id].schema)
|
||||
logger.debug("Schema validation successful")
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Schema validation fail: {e}")
|
||||
|
||||
return obj
|
||||
|
||||
if resp_type == "jsonl":
|
||||
objects = self.parse_jsonl(resp)
|
||||
|
||||
if not objects:
|
||||
logger.warning("JSONL parse returned no valid objects")
|
||||
return []
|
||||
|
||||
# Validate each object against schema if provided
|
||||
if self.prompts[id].schema:
|
||||
validated = []
|
||||
for i, obj in enumerate(objects):
|
||||
try:
|
||||
validate(instance=obj, schema=self.prompts[id].schema)
|
||||
validated.append(obj)
|
||||
except Exception as e:
|
||||
logger.warning(f"Object {i} failed schema validation: {e}")
|
||||
return validated
|
||||
|
||||
return objects
|
||||
|
||||
raise RuntimeError(f"Response type {resp_type} not known")
|
||||
```
|
||||
|
||||
### Affected Prompts
|
||||
|
||||
The following prompts should be migrated to JSONL format:
|
||||
|
||||
| Prompt ID | Description | Type Field |
|
||||
|-----------|-------------|------------|
|
||||
| `extract-definitions` | Entity/definition extraction | No (single type) |
|
||||
| `extract-relationships` | Relationship extraction | No (single type) |
|
||||
| `extract-topics` | Topic/definition extraction | No (single type) |
|
||||
| `extract-rows` | Structured row extraction | No (single type) |
|
||||
| `agent-kg-extract` | Combined definition + relationship extraction | Yes: `"definition"`, `"relationship"` |
|
||||
| `extract-with-ontologies` / `ontology-extract` | Ontology-based extraction | Yes: `"entity"`, `"relationship"`, `"attribute"` |
|
||||
|
||||
### API Changes
|
||||
|
||||
#### Client Perspective
|
||||
|
||||
JSONL parsing is transparent to prompt service API callers. The parsing occurs
|
||||
server-side in the prompt service, and the response is returned via the standard
|
||||
`PromptResponse.object` field as a serialized JSON array.
|
||||
|
||||
When clients call the prompt service (via `PromptClient.prompt()` or similar):
|
||||
|
||||
- **`response-type: "json"`** with array schema → client receives Python `list`
|
||||
- **`response-type: "jsonl"`** → client receives Python `list`
|
||||
|
||||
From the client's perspective, both return identical data structures. The
|
||||
difference is entirely in how the LLM output is parsed server-side:
|
||||
|
||||
- JSON array format: Single `json.loads()` call; fails completely if truncated
|
||||
- JSONL format: Line-by-line parsing; yields partial results if truncated
|
||||
|
||||
This means existing client code expecting a list from extraction prompts
|
||||
requires no changes when migrating prompts from JSON to JSONL format.
|
||||
|
||||
#### Server Return Value
|
||||
|
||||
For `response-type: "jsonl"`, the `PromptManager.invoke()` method returns a
|
||||
`list[dict]` containing all successfully parsed and validated objects. This
|
||||
list is then serialized to JSON for the `PromptResponse.object` field.
|
||||
|
||||
#### Error Handling
|
||||
|
||||
- Empty results: Returns empty list `[]` with warning log
|
||||
- Partial parse failure: Returns list of successfully parsed objects with
|
||||
warning logs for failures
|
||||
- Complete parse failure: Returns empty list `[]` with warning logs
|
||||
|
||||
This differs from `response-type: "json"` which raises `RuntimeError` on
|
||||
parse failure. The lenient behavior for JSONL is intentional to provide
|
||||
truncation resilience.
|
||||
|
||||
### Configuration Example
|
||||
|
||||
Complete prompt configuration example:
|
||||
|
||||
```json
|
||||
{
|
||||
"prompt": "Extract all entities and their definitions from the following text. Output one JSON object per line.\n\nText:\n{{text}}\n\nOutput format per line:\n{\"entity\": \"<name>\", \"definition\": \"<definition>\"}",
|
||||
"response-type": "jsonl",
|
||||
"schema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"entity": {
|
||||
"type": "string",
|
||||
"description": "The entity name"
|
||||
},
|
||||
"definition": {
|
||||
"type": "string",
|
||||
"description": "A clear definition of the entity"
|
||||
}
|
||||
},
|
||||
"required": ["entity", "definition"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- **Input Validation**: JSON parsing uses standard `json.loads()` which is safe
|
||||
against injection attacks
|
||||
- **Schema Validation**: Uses `jsonschema.validate()` for schema enforcement
|
||||
- **No New Attack Surface**: JSONL parsing is strictly safer than JSON array
|
||||
parsing due to line-by-line processing
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Memory**: Line-by-line parsing uses less peak memory than loading full JSON
|
||||
arrays
|
||||
- **Latency**: Parsing performance is comparable to JSON array parsing
|
||||
- **Validation**: Schema validation runs per-object, which adds overhead but
|
||||
enables partial results on validation failure
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
|
||||
- JSONL parsing with valid input
|
||||
- JSONL parsing with empty lines
|
||||
- JSONL parsing with markdown code fences
|
||||
- JSONL parsing with truncated final line
|
||||
- JSONL parsing with invalid JSON lines interspersed
|
||||
- Schema validation with `oneOf` discriminated unions
|
||||
- Backward compatibility: existing `"text"` and `"json"` prompts unchanged
|
||||
|
||||
### Integration Tests
|
||||
|
||||
- End-to-end extraction with JSONL prompts
|
||||
- Extraction with simulated truncation (artificially limited response)
|
||||
- Mixed-type extraction with type discriminator
|
||||
- Ontology extraction with all three types
|
||||
|
||||
### Extraction Quality Tests
|
||||
|
||||
- Compare extraction results: JSONL vs JSON array format
|
||||
- Verify truncation resilience: JSONL yields partial results where JSON fails
|
||||
|
||||
## Migration Plan
|
||||
|
||||
### Phase 1: Implementation
|
||||
|
||||
1. Implement `parse_jsonl()` method in `PromptManager`
|
||||
2. Extend `invoke()` to handle `response-type: "jsonl"`
|
||||
3. Add unit tests
|
||||
|
||||
### Phase 2: Prompt Migration
|
||||
|
||||
1. Update `extract-definitions` prompt and configuration
|
||||
2. Update `extract-relationships` prompt and configuration
|
||||
3. Update `extract-topics` prompt and configuration
|
||||
4. Update `extract-rows` prompt and configuration
|
||||
5. Update `agent-kg-extract` prompt and configuration
|
||||
6. Update `extract-with-ontologies` prompt and configuration
|
||||
|
||||
### Phase 3: Downstream Updates
|
||||
|
||||
1. Update any code consuming extraction results to handle list return type
|
||||
2. Update code that categorizes mixed-type extractions by `type` field
|
||||
3. Update tests that assert on extraction output format
|
||||
|
||||
## Open Questions
|
||||
|
||||
None at this time.
|
||||
|
||||
## References
|
||||
|
||||
- Current implementation: `trustgraph-flow/trustgraph/template/prompt_manager.py`
|
||||
- JSON Lines specification: https://jsonlines.org/
|
||||
- JSON Schema `oneOf`: https://json-schema.org/understanding-json-schema/reference/combining.html#oneof
|
||||
- Related specification: Streaming LLM Responses (`docs/tech-specs/streaming-llm-responses.md`)
|
||||
613
docs/tech-specs/structured-data-2.md
Normal file
613
docs/tech-specs/structured-data-2.md
Normal file
|
|
@ -0,0 +1,613 @@
|
|||
# Structured Data Technical Specification (Part 2)
|
||||
|
||||
## Overview
|
||||
|
||||
This specification addresses issues and gaps identified during the initial implementation of TrustGraph's structured data integration, as described in `structured-data.md`.
|
||||
|
||||
## Problem Statements
|
||||
|
||||
### 1. Naming Inconsistency: "Object" vs "Row"
|
||||
|
||||
The current implementation uses "object" terminology throughout (e.g., `ExtractedObject`, object extraction, object embeddings). This naming is too generic and causes confusion:
|
||||
|
||||
- "Object" is an overloaded term in software (Python objects, JSON objects, etc.)
|
||||
- The data being handled is fundamentally tabular - rows in tables with defined schemas
|
||||
- "Row" more accurately describes the data model and aligns with database terminology
|
||||
|
||||
This inconsistency appears in module names, class names, message types, and documentation.
|
||||
|
||||
### 2. Row Store Query Limitations
|
||||
|
||||
The current row store implementation has significant query limitations:
|
||||
|
||||
**Natural Language Mismatch**: Queries struggle with real-world data variations. For example:
|
||||
- A street database containing `"CHESTNUT ST"` is difficult to find when asking about `"Chestnut Street"`
|
||||
- Abbreviations, case differences, and formatting variations break exact-match queries
|
||||
- Users expect semantic understanding, but the store provides literal matching
|
||||
|
||||
**Schema Evolution Issues**: Changing schemas causes problems:
|
||||
- Existing data may not conform to updated schemas
|
||||
- Table structure changes can break queries and data integrity
|
||||
- No clear migration path for schema updates
|
||||
|
||||
### 3. Row Embeddings Required
|
||||
|
||||
Related to problem 2, the system needs vector embeddings for row data to enable:
|
||||
|
||||
- Semantic search across structured data (finding "Chestnut Street" when data contains "CHESTNUT ST")
|
||||
- Similarity matching for fuzzy queries
|
||||
- Hybrid search combining structured filters with semantic similarity
|
||||
- Better natural language query support
|
||||
|
||||
The embedding service was specified but not implemented.
|
||||
|
||||
### 4. Row Data Ingestion Incomplete
|
||||
|
||||
The structured data ingestion pipeline is not fully operational:
|
||||
|
||||
- Diagnostic prompts exist to classify input formats (CSV, JSON, etc.)
|
||||
- The ingestion service that uses these prompts is not plumbed into the system
|
||||
- No end-to-end path for loading pre-structured data into the row store
|
||||
|
||||
## Goals
|
||||
|
||||
- **Schema Flexibility**: Enable schema evolution without breaking existing data or requiring migrations
|
||||
- **Consistent Naming**: Standardize on "row" terminology throughout the codebase
|
||||
- **Semantic Queryability**: Support fuzzy/semantic matching via row embeddings
|
||||
- **Complete Ingestion Pipeline**: Provide end-to-end path for loading structured data
|
||||
|
||||
## Technical Design
|
||||
|
||||
### Unified Row Storage Schema
|
||||
|
||||
The previous implementation created a separate Cassandra table for each schema. This caused problems when schemas evolved, as table structure changes required migrations.
|
||||
|
||||
The new design uses a single unified table for all row data:
|
||||
|
||||
```sql
|
||||
CREATE TABLE rows (
|
||||
collection text,
|
||||
schema_name text,
|
||||
index_name text,
|
||||
index_value frozen<list<text>>,
|
||||
data map<text, text>,
|
||||
source text,
|
||||
PRIMARY KEY ((collection, schema_name, index_name), index_value)
|
||||
)
|
||||
```
|
||||
|
||||
#### Column Definitions
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `collection` | `text` | Data collection/import identifier (from metadata) |
|
||||
| `schema_name` | `text` | Name of the schema this row conforms to |
|
||||
| `index_name` | `text` | Name of the indexed field(s), comma-joined for composites |
|
||||
| `index_value` | `frozen<list<text>>` | Index value(s) as a list |
|
||||
| `data` | `map<text, text>` | Row data as key-value pairs |
|
||||
| `source` | `text` | Optional URI linking to provenance information in the knowledge graph. Empty string or NULL indicates no source. |
|
||||
|
||||
#### Index Handling
|
||||
|
||||
Each row is stored multiple times - once per indexed field defined in the schema. The primary key fields are treated as an index with no special marker, providing future flexibility.
|
||||
|
||||
**Single-field index example:**
|
||||
- Schema defines `email` as indexed
|
||||
- `index_name = "email"`
|
||||
- `index_value = ['foo@bar.com']`
|
||||
|
||||
**Composite index example:**
|
||||
- Schema defines composite index on `region` and `status`
|
||||
- `index_name = "region,status"` (field names sorted and comma-joined)
|
||||
- `index_value = ['US', 'active']` (values in same order as field names)
|
||||
|
||||
**Primary key example:**
|
||||
- Schema defines `customer_id` as primary key
|
||||
- `index_name = "customer_id"`
|
||||
- `index_value = ['CUST001']`
|
||||
|
||||
#### Query Patterns
|
||||
|
||||
All queries follow the same pattern regardless of which index is used:
|
||||
|
||||
```sql
|
||||
SELECT * FROM rows
|
||||
WHERE collection = 'import_2024'
|
||||
AND schema_name = 'customers'
|
||||
AND index_name = 'email'
|
||||
AND index_value = ['foo@bar.com']
|
||||
```
|
||||
|
||||
#### Design Trade-offs
|
||||
|
||||
**Advantages:**
|
||||
- Schema changes don't require table structure changes
|
||||
- Row data is opaque to Cassandra - field additions/removals are transparent
|
||||
- Consistent query pattern for all access methods
|
||||
- No Cassandra secondary indexes (which can be slow at scale)
|
||||
- Native Cassandra types throughout (`map`, `frozen<list>`)
|
||||
|
||||
**Trade-offs:**
|
||||
- Write amplification: each row insert = N inserts (one per indexed field)
|
||||
- Storage overhead from duplicated row data
|
||||
- Type information stored in schema config, conversion at application layer
|
||||
|
||||
#### Consistency Model
|
||||
|
||||
The design accepts certain simplifications:
|
||||
|
||||
1. **No row updates**: The system is append-only. This eliminates consistency concerns about updating multiple copies of the same row.
|
||||
|
||||
2. **Schema change tolerance**: When schemas change (e.g., indexes added/removed), existing rows retain their original indexing. Old rows won't be discoverable via new indexes. Users can delete and recreate a schema to ensure consistency if needed.
|
||||
|
||||
### Partition Tracking and Deletion
|
||||
|
||||
#### The Problem
|
||||
|
||||
With the partition key `(collection, schema_name, index_name)`, efficient deletion requires knowing all partition keys to delete. Deleting by just `collection` or `collection + schema_name` requires knowing all the `index_name` values that have data.
|
||||
|
||||
#### Partition Tracking Table
|
||||
|
||||
A secondary lookup table tracks which partitions exist:
|
||||
|
||||
```sql
|
||||
CREATE TABLE row_partitions (
|
||||
collection text,
|
||||
schema_name text,
|
||||
index_name text,
|
||||
PRIMARY KEY ((collection), schema_name, index_name)
|
||||
)
|
||||
```
|
||||
|
||||
This enables efficient discovery of partitions for deletion operations.
|
||||
|
||||
#### Row Writer Behavior
|
||||
|
||||
The row writer maintains an in-memory cache of registered `(collection, schema_name)` pairs. When processing a row:
|
||||
|
||||
1. Check if `(collection, schema_name)` is in the cache
|
||||
2. If not cached (first row for this pair):
|
||||
- Look up the schema config to get all index names
|
||||
- Insert entries into `row_partitions` for each `(collection, schema_name, index_name)`
|
||||
- Add the pair to the cache
|
||||
3. Proceed with writing the row data
|
||||
|
||||
The row writer also monitors schema config change events. When a schema changes, relevant cache entries are cleared so the next row triggers re-registration with the updated index names.
|
||||
|
||||
This approach ensures:
|
||||
- Lookup table writes happen once per `(collection, schema_name)` pair, not per row
|
||||
- The lookup table reflects the indexes that were active when data was written
|
||||
- Schema changes mid-import are picked up correctly
|
||||
|
||||
#### Deletion Operations
|
||||
|
||||
**Delete collection:**
|
||||
```sql
|
||||
-- 1. Discover all partitions
|
||||
SELECT schema_name, index_name FROM row_partitions WHERE collection = 'X';
|
||||
|
||||
-- 2. Delete each partition from rows table
|
||||
DELETE FROM rows WHERE collection = 'X' AND schema_name = '...' AND index_name = '...';
|
||||
-- (repeat for each discovered partition)
|
||||
|
||||
-- 3. Clean up the lookup table
|
||||
DELETE FROM row_partitions WHERE collection = 'X';
|
||||
```
|
||||
|
||||
**Delete collection + schema:**
|
||||
```sql
|
||||
-- 1. Discover partitions for this schema
|
||||
SELECT index_name FROM row_partitions WHERE collection = 'X' AND schema_name = 'Y';
|
||||
|
||||
-- 2. Delete each partition from rows table
|
||||
DELETE FROM rows WHERE collection = 'X' AND schema_name = 'Y' AND index_name = '...';
|
||||
-- (repeat for each discovered partition)
|
||||
|
||||
-- 3. Clean up the lookup table entries
|
||||
DELETE FROM row_partitions WHERE collection = 'X' AND schema_name = 'Y';
|
||||
```
|
||||
|
||||
### Row Embeddings
|
||||
|
||||
Row embeddings enable semantic/fuzzy matching on indexed values, solving the natural language mismatch problem (e.g., finding "CHESTNUT ST" when querying for "Chestnut Street").
|
||||
|
||||
#### Design Overview
|
||||
|
||||
Each indexed value is embedded and stored in a vector store (Qdrant). At query time, the query is embedded, similar vectors are found, and the associated metadata is used to look up the actual rows in Cassandra.
|
||||
|
||||
#### Qdrant Collection Structure
|
||||
|
||||
One Qdrant collection per `(user, collection, schema_name, dimension)` tuple:
|
||||
|
||||
- **Collection naming:** `rows_{user}_{collection}_{schema_name}_{dimension}`
|
||||
- Names are sanitized (non-alphanumeric characters replaced with `_`, lowercased, numeric prefixes get `r_` prefix)
|
||||
- **Rationale:** Enables clean deletion of a `(user, collection, schema_name)` instance by dropping matching Qdrant collections; dimension suffix allows different embedding models to coexist
|
||||
|
||||
#### What Gets Embedded
|
||||
|
||||
The text representation of index values:
|
||||
|
||||
| Index Type | Example `index_value` | Text to Embed |
|
||||
|------------|----------------------|---------------|
|
||||
| Single-field | `['foo@bar.com']` | `"foo@bar.com"` |
|
||||
| Composite | `['US', 'active']` | `"US active"` (space-joined) |
|
||||
|
||||
#### Point Structure
|
||||
|
||||
Each Qdrant point contains:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "<uuid>",
|
||||
"vector": [0.1, 0.2, ...],
|
||||
"payload": {
|
||||
"index_name": "street_name",
|
||||
"index_value": ["CHESTNUT ST"],
|
||||
"text": "CHESTNUT ST"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
| Payload Field | Description |
|
||||
|---------------|-------------|
|
||||
| `index_name` | The indexed field(s) this embedding represents |
|
||||
| `index_value` | The original list of values (for Cassandra lookup) |
|
||||
| `text` | The text that was embedded (for debugging/display) |
|
||||
|
||||
Note: `user`, `collection`, and `schema_name` are implicit from the Qdrant collection name.
|
||||
|
||||
#### Query Flow
|
||||
|
||||
1. User queries for "Chestnut Street" within user U, collection X, schema Y
|
||||
2. Embed the query text
|
||||
3. Determine Qdrant collection name(s) matching prefix `rows_U_X_Y_`
|
||||
4. Search matching Qdrant collection(s) for nearest vectors
|
||||
5. Get matching points with payloads containing `index_name` and `index_value`
|
||||
6. Query Cassandra:
|
||||
```sql
|
||||
SELECT * FROM rows
|
||||
WHERE collection = 'X'
|
||||
AND schema_name = 'Y'
|
||||
AND index_name = '<from payload>'
|
||||
AND index_value = <from payload>
|
||||
```
|
||||
7. Return matched rows
|
||||
|
||||
#### Optional: Filtering by Index Name
|
||||
|
||||
Queries can optionally filter by `index_name` in Qdrant to search only specific fields:
|
||||
|
||||
- **"Find any field matching 'Chestnut'"** → search all vectors in the collection
|
||||
- **"Find street_name matching 'Chestnut'"** → filter where `payload.index_name = 'street_name'`
|
||||
|
||||
#### Architecture
|
||||
|
||||
Row embeddings follow the **two-stage pattern** used by GraphRAG (graph-embeddings, document-embeddings):
|
||||
|
||||
- **Stage 1: Embedding computation** (`trustgraph-flow/trustgraph/embeddings/row_embeddings/`) - Consumes `ExtractedObject`, computes embeddings via the embeddings service, outputs `RowEmbeddings`
|
||||
- **Stage 2: Embedding storage** (`trustgraph-flow/trustgraph/storage/row_embeddings/qdrant/`) - Consumes `RowEmbeddings`, writes vectors to Qdrant
|
||||
|
||||
The Cassandra row writer is a separate parallel consumer:
|
||||
|
||||
- **Cassandra row writer** (`trustgraph-flow/trustgraph/storage/rows/cassandra`) - Consumes `ExtractedObject`, writes rows to Cassandra
|
||||
|
||||
All three services consume from the same flow, keeping them decoupled. This allows:
|
||||
- Independent scaling of Cassandra writes vs embedding generation vs vector storage
|
||||
- Embedding services can be disabled if not needed
|
||||
- Failures in one service don't affect the others
|
||||
- Consistent architecture with GraphRAG pipelines
|
||||
|
||||
#### Write Path
|
||||
|
||||
**Stage 1 (row-embeddings processor):** When receiving an `ExtractedObject`:
|
||||
|
||||
1. Look up the schema to find indexed fields
|
||||
2. For each indexed field:
|
||||
- Build the text representation of the index value
|
||||
- Compute embedding via the embeddings service
|
||||
3. Output a `RowEmbeddings` message containing all computed vectors
|
||||
|
||||
**Stage 2 (row-embeddings-write-qdrant):** When receiving a `RowEmbeddings`:
|
||||
|
||||
1. For each embedding in the message:
|
||||
- Determine Qdrant collection from `(user, collection, schema_name, dimension)`
|
||||
- Create collection if needed (lazy creation on first write)
|
||||
- Upsert point with vector and payload
|
||||
|
||||
#### Message Types
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class RowIndexEmbedding:
|
||||
index_name: str # The indexed field name(s)
|
||||
index_value: list[str] # The field value(s)
|
||||
text: str # Text that was embedded
|
||||
vectors: list[list[float]] # Computed embedding vectors
|
||||
|
||||
@dataclass
|
||||
class RowEmbeddings:
|
||||
metadata: Metadata
|
||||
schema_name: str
|
||||
embeddings: list[RowIndexEmbedding]
|
||||
```
|
||||
|
||||
#### Deletion Integration
|
||||
|
||||
Qdrant collections are discovered by prefix matching on the collection name pattern:
|
||||
|
||||
**Delete `(user, collection)`:**
|
||||
1. List all Qdrant collections matching prefix `rows_{user}_{collection}_`
|
||||
2. Delete each matching collection
|
||||
3. Delete Cassandra rows partitions (as documented above)
|
||||
4. Clean up `row_partitions` entries
|
||||
|
||||
**Delete `(user, collection, schema_name)`:**
|
||||
1. List all Qdrant collections matching prefix `rows_{user}_{collection}_{schema_name}_`
|
||||
2. Delete each matching collection (handles multiple dimensions)
|
||||
3. Delete Cassandra rows partitions
|
||||
4. Clean up `row_partitions`
|
||||
|
||||
#### Module Locations
|
||||
|
||||
| Stage | Module | Entry Point |
|
||||
|-------|--------|-------------|
|
||||
| Stage 1 | `trustgraph-flow/trustgraph/embeddings/row_embeddings/` | `row-embeddings` |
|
||||
| Stage 2 | `trustgraph-flow/trustgraph/storage/row_embeddings/qdrant/` | `row-embeddings-write-qdrant` |
|
||||
|
||||
### Row Embeddings Query API
|
||||
|
||||
The row embeddings query is a **separate API** from the GraphQL row query service:
|
||||
|
||||
| API | Purpose | Backend |
|
||||
|-----|---------|---------|
|
||||
| Row Query (GraphQL) | Exact matching on indexed fields | Cassandra |
|
||||
| Row Embeddings Query | Fuzzy/semantic matching | Qdrant |
|
||||
|
||||
This separation keeps concerns clean:
|
||||
- GraphQL service focuses on exact, structured queries
|
||||
- Embeddings API handles semantic similarity
|
||||
- User workflow: fuzzy search via embeddings to find candidates, then exact query to get full row data
|
||||
|
||||
#### Request/Response Schema
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class RowEmbeddingsRequest:
|
||||
vectors: list[list[float]] # Query vectors (pre-computed embeddings)
|
||||
user: str = ""
|
||||
collection: str = ""
|
||||
schema_name: str = ""
|
||||
index_name: str = "" # Optional: filter to specific index
|
||||
limit: int = 10 # Max results per vector
|
||||
|
||||
@dataclass
|
||||
class RowIndexMatch:
|
||||
index_name: str = "" # The matched index field(s)
|
||||
index_value: list[str] = [] # The matched value(s)
|
||||
text: str = "" # Original text that was embedded
|
||||
score: float = 0.0 # Similarity score
|
||||
|
||||
@dataclass
|
||||
class RowEmbeddingsResponse:
|
||||
error: Error | None = None
|
||||
matches: list[RowIndexMatch] = []
|
||||
```
|
||||
|
||||
#### Query Processor
|
||||
|
||||
Module: `trustgraph-flow/trustgraph/query/row_embeddings/qdrant`
|
||||
|
||||
Entry point: `row-embeddings-query-qdrant`
|
||||
|
||||
The processor:
|
||||
1. Receives `RowEmbeddingsRequest` with query vectors
|
||||
2. Finds the appropriate Qdrant collection by prefix matching
|
||||
3. Searches for nearest vectors with optional `index_name` filter
|
||||
4. Returns `RowEmbeddingsResponse` with matching index information
|
||||
|
||||
#### API Gateway Integration
|
||||
|
||||
The gateway exposes row embeddings queries via the standard request/response pattern:
|
||||
|
||||
| Component | Location |
|
||||
|-----------|----------|
|
||||
| Dispatcher | `trustgraph-flow/trustgraph/gateway/dispatch/row_embeddings_query.py` |
|
||||
| Registration | Add `"row-embeddings"` to `request_response_dispatchers` in `manager.py` |
|
||||
|
||||
Flow interface name: `row-embeddings`
|
||||
|
||||
Interface definition in flow blueprint:
|
||||
```json
|
||||
{
|
||||
"interfaces": {
|
||||
"row-embeddings": {
|
||||
"request": "non-persistent://tg/request/row-embeddings:{id}",
|
||||
"response": "non-persistent://tg/response/row-embeddings:{id}"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Python SDK Support
|
||||
|
||||
The SDK provides methods for row embeddings queries:
|
||||
|
||||
```python
|
||||
# Flow-scoped query (preferred)
|
||||
api = Api(url)
|
||||
flow = api.flow().id("default")
|
||||
|
||||
# Query with text (SDK computes embeddings)
|
||||
matches = flow.row_embeddings_query(
|
||||
text="Chestnut Street",
|
||||
collection="my_collection",
|
||||
schema_name="addresses",
|
||||
index_name="street_name", # Optional filter
|
||||
limit=10
|
||||
)
|
||||
|
||||
# Query with pre-computed vectors
|
||||
matches = flow.row_embeddings_query(
|
||||
vectors=[[0.1, 0.2, ...]],
|
||||
collection="my_collection",
|
||||
schema_name="addresses"
|
||||
)
|
||||
|
||||
# Each match contains:
|
||||
for match in matches:
|
||||
print(match.index_name) # e.g., "street_name"
|
||||
print(match.index_value) # e.g., ["CHESTNUT ST"]
|
||||
print(match.text) # e.g., "CHESTNUT ST"
|
||||
print(match.score) # e.g., 0.95
|
||||
```
|
||||
|
||||
#### CLI Utility
|
||||
|
||||
Command: `tg-invoke-row-embeddings`
|
||||
|
||||
```bash
|
||||
# Query by text (computes embedding automatically)
|
||||
tg-invoke-row-embeddings \
|
||||
--text "Chestnut Street" \
|
||||
--collection my_collection \
|
||||
--schema addresses \
|
||||
--index street_name \
|
||||
--limit 10
|
||||
|
||||
# Query by vector file
|
||||
tg-invoke-row-embeddings \
|
||||
--vectors vectors.json \
|
||||
--collection my_collection \
|
||||
--schema addresses
|
||||
|
||||
# Output formats
|
||||
tg-invoke-row-embeddings --text "..." --format json
|
||||
tg-invoke-row-embeddings --text "..." --format table
|
||||
```
|
||||
|
||||
#### Typical Usage Pattern
|
||||
|
||||
The row embeddings query is typically used as part of a fuzzy-to-exact lookup flow:
|
||||
|
||||
```python
|
||||
# Step 1: Fuzzy search via embeddings
|
||||
matches = flow.row_embeddings_query(
|
||||
text="chestnut street",
|
||||
collection="geo",
|
||||
schema_name="streets"
|
||||
)
|
||||
|
||||
# Step 2: Exact lookup via GraphQL for full row data
|
||||
for match in matches:
|
||||
query = f'''
|
||||
query {{
|
||||
streets(where: {{ {match.index_name}: {{ eq: "{match.index_value[0]}" }} }}) {{
|
||||
street_name
|
||||
city
|
||||
zip_code
|
||||
}}
|
||||
}}
|
||||
'''
|
||||
rows = flow.rows_query(query, collection="geo")
|
||||
```
|
||||
|
||||
This two-step pattern enables:
|
||||
- Finding "CHESTNUT ST" when user searches for "Chestnut Street"
|
||||
- Retrieving complete row data with all fields
|
||||
- Combining semantic similarity with structured data access
|
||||
|
||||
### Row Data Ingestion
|
||||
|
||||
Deferred to a subsequent phase. Will be designed alongside other ingestion changes.
|
||||
|
||||
## Implementation Impact
|
||||
|
||||
### Current State Analysis
|
||||
|
||||
The existing implementation has two main components:
|
||||
|
||||
| Component | Location | Lines | Description |
|
||||
|-----------|----------|-------|-------------|
|
||||
| Query Service | `trustgraph-flow/trustgraph/query/objects/cassandra/service.py` | ~740 | Monolithic: GraphQL schema generation, filter parsing, Cassandra queries, request handling |
|
||||
| Writer | `trustgraph-flow/trustgraph/storage/objects/cassandra/write.py` | ~540 | Per-schema table creation, secondary indexes, insert/delete |
|
||||
|
||||
**Current Query Pattern:**
|
||||
```sql
|
||||
SELECT * FROM {keyspace}.o_{schema_name}
|
||||
WHERE collection = 'X' AND email = 'foo@bar.com'
|
||||
ALLOW FILTERING
|
||||
```
|
||||
|
||||
**New Query Pattern:**
|
||||
```sql
|
||||
SELECT * FROM {keyspace}.rows
|
||||
WHERE collection = 'X' AND schema_name = 'customers'
|
||||
AND index_name = 'email' AND index_value = ['foo@bar.com']
|
||||
```
|
||||
|
||||
### Key Changes
|
||||
|
||||
1. **Query semantics simplify**: The new schema only supports exact matches on `index_value`. The current GraphQL filters (`gt`, `lt`, `contains`, etc.) either:
|
||||
- Become post-filtering on returned data (if still needed)
|
||||
- Are removed in favor of using the embeddings API for fuzzy matching
|
||||
|
||||
2. **GraphQL code is tightly coupled**: The current `service.py` bundles Strawberry type generation, filter parsing, and Cassandra-specific queries. Adding another row store backend would duplicate ~400 lines of GraphQL code.
|
||||
|
||||
### Proposed Refactor
|
||||
|
||||
The refactor has two parts:
|
||||
|
||||
#### 1. Break Out GraphQL Code
|
||||
|
||||
Extract reusable GraphQL components into a shared module:
|
||||
|
||||
```
|
||||
trustgraph-flow/trustgraph/query/graphql/
|
||||
├── __init__.py
|
||||
├── types.py # Filter types (IntFilter, StringFilter, FloatFilter)
|
||||
├── schema.py # Dynamic schema generation from RowSchema
|
||||
└── filters.py # Filter parsing utilities
|
||||
```
|
||||
|
||||
This enables:
|
||||
- Reuse across different row store backends
|
||||
- Cleaner separation of concerns
|
||||
- Easier testing of GraphQL logic independently
|
||||
|
||||
#### 2. Implement New Table Schema
|
||||
|
||||
Refactor the Cassandra-specific code to use the unified table:
|
||||
|
||||
**Writer** (`trustgraph-flow/trustgraph/storage/rows/cassandra/`):
|
||||
- Single `rows` table instead of per-schema tables
|
||||
- Write N copies per row (one per index)
|
||||
- Register to `row_partitions` table
|
||||
- Simpler table creation (one-time setup)
|
||||
|
||||
**Query Service** (`trustgraph-flow/trustgraph/query/rows/cassandra/`):
|
||||
- Query the unified `rows` table
|
||||
- Use extracted GraphQL module for schema generation
|
||||
- Simplified filter handling (exact match only at DB level)
|
||||
|
||||
### Module Renames
|
||||
|
||||
As part of the "object" → "row" naming cleanup:
|
||||
|
||||
| Current | New |
|
||||
|---------|-----|
|
||||
| `storage/objects/cassandra/` | `storage/rows/cassandra/` |
|
||||
| `query/objects/cassandra/` | `query/rows/cassandra/` |
|
||||
| `embeddings/object_embeddings/` | `embeddings/row_embeddings/` |
|
||||
|
||||
### New Modules
|
||||
|
||||
| Module | Purpose |
|
||||
|--------|---------|
|
||||
| `trustgraph-flow/trustgraph/query/graphql/` | Shared GraphQL utilities |
|
||||
| `trustgraph-flow/trustgraph/query/row_embeddings/qdrant/` | Row embeddings query API |
|
||||
| `trustgraph-flow/trustgraph/embeddings/row_embeddings/` | Row embeddings computation (Stage 1) |
|
||||
| `trustgraph-flow/trustgraph/storage/row_embeddings/qdrant/` | Row embeddings storage (Stage 2) |
|
||||
|
||||
## References
|
||||
|
||||
- [Structured Data Technical Specification](structured-data.md)
|
||||
Loading…
Add table
Add a link
Reference in a new issue