mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 16:36:21 +02:00
Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.
579 lines
18 KiB
Markdown
579 lines
18 KiB
Markdown
---
|
|
layout: default
|
|
title: "Graph Contexts Technical Specification"
|
|
parent: "Tech Specs"
|
|
---
|
|
|
|
# Graph Contexts Technical Specification
|
|
|
|
## Overview
|
|
|
|
This specification describes changes to TrustGraph's core graph primitives to
|
|
align with RDF 1.2 and support full RDF Dataset semantics. This is a breaking
|
|
change for the 2.x release series.
|
|
|
|
### Versioning
|
|
|
|
- **2.0**: Early adopter release. Core features available, may not be fully
|
|
production-ready.
|
|
- **2.1 / 2.2**: Production release. Stability and completeness validated.
|
|
|
|
Flexibility on maturity is intentional - early adopters can access new
|
|
capabilities before all features are production-hardened.
|
|
|
|
## Goals
|
|
|
|
The primary goals for this work are to enable metadata about facts/statements:
|
|
|
|
- **Temporal information**: Associate facts with time metadata
|
|
- When a fact was believed to be true
|
|
- When a fact became true
|
|
- When a fact was discovered to be false
|
|
|
|
- **Provenance/Sources**: Track which sources support a fact
|
|
- "This fact was supported by source X"
|
|
- Link facts back to their origin documents
|
|
|
|
- **Veracity/Trust**: Record assertions about truth
|
|
- "Person P asserted this was true"
|
|
- "Person Q claims this is false"
|
|
- Enable trust scoring and conflict detection
|
|
|
|
**Hypothesis**: Reification (RDF-star / quoted triples) is the key mechanism
|
|
to achieve these outcomes, as all require making statements about statements.
|
|
|
|
## Background
|
|
|
|
To express "the fact (Alice knows Bob) was discovered on 2024-01-15" or
|
|
"source X supports the claim (Y causes Z)", you need to reference an edge
|
|
as a thing you can make statements about. Standard triples don't support this.
|
|
|
|
### Current Limitations
|
|
|
|
The current `Value` class in `trustgraph-base/trustgraph/schema/core/primitives.py`
|
|
can represent:
|
|
- URI nodes (`is_uri=True`)
|
|
- Literal values (`is_uri=False`)
|
|
|
|
The `type` field exists but is not used to represent XSD datatypes.
|
|
|
|
## Technical Design
|
|
|
|
### RDF Features to Support
|
|
|
|
#### Core Features (Related to Reification Goals)
|
|
|
|
These features are directly related to the temporal, provenance, and veracity
|
|
goals:
|
|
|
|
1. **RDF 1.2 Quoted Triples (RDF-star)**
|
|
- Edges that point at other edges
|
|
- A Triple can appear as the subject or object of another Triple
|
|
- Enables statements about statements (reification)
|
|
- Core mechanism for annotating individual facts
|
|
|
|
2. **RDF Dataset / Named Graphs**
|
|
- Support for multiple named graphs within a dataset
|
|
- Each graph identified by an IRI
|
|
- Moves from triples (s, p, o) to quads (s, p, o, g)
|
|
- Includes a default graph plus zero or more named graphs
|
|
- The graph IRI can be a subject in statements, e.g.:
|
|
```
|
|
<graph-source-A> <discoveredOn> "2024-01-15"
|
|
<graph-source-A> <hasVeracity> "high"
|
|
```
|
|
- Note: Named graphs are a separate feature from reification. They have
|
|
uses beyond statement annotation (partitioning, access control, dataset
|
|
organization) and should be treated as a distinct capability.
|
|
|
|
3. **Blank Nodes** (Limited Support)
|
|
- Anonymous nodes without a global URI
|
|
- Supported for compatibility when loading external RDF data
|
|
- **Limited status**: No guarantees about stable identity after loading
|
|
- Find them via wildcard queries (match by connections, not by ID)
|
|
- Not a first-class feature - don't rely on precise blank node handling
|
|
|
|
#### Opportunistic Fixes (2.0 Breaking Change)
|
|
|
|
These features are not directly related to the reification goals but are
|
|
valuable improvements to include while making breaking changes:
|
|
|
|
4. **Literal Datatypes**
|
|
- Properly use the `type` field for XSD datatypes
|
|
- Examples: xsd:string, xsd:integer, xsd:dateTime, etc.
|
|
- Fixes current limitation: cannot represent dates or integers properly
|
|
|
|
5. **Language Tags**
|
|
- Support for language attributes on string literals (@en, @fr, etc.)
|
|
- Note: A literal has either a language tag OR a datatype, not both
|
|
(except for rdf:langString)
|
|
- Important for AI/multilingual use cases
|
|
|
|
### Data Models
|
|
|
|
#### Term (rename from Value)
|
|
|
|
The `Value` class will be renamed to `Term` to better reflect RDF terminology.
|
|
This rename serves two purposes:
|
|
1. Aligns naming with RDF concepts (a "Term" can be an IRI, literal, blank
|
|
node, or quoted triple - not just a "value")
|
|
2. Forces code review at the breaking change interface - any code still
|
|
referencing `Value` is visibly broken and needs updating
|
|
|
|
A Term can represent:
|
|
|
|
- **IRI/URI** - A named node/resource
|
|
- **Blank Node** - An anonymous node with local scope
|
|
- **Literal** - A data value with either:
|
|
- A datatype (XSD type), OR
|
|
- A language tag
|
|
- **Quoted Triple** - A triple used as a term (RDF 1.2)
|
|
|
|
##### Chosen Approach: Single Class with Type Discriminator
|
|
|
|
Serialization requirements drive the structure - a type discriminator is needed
|
|
in the wire format regardless of the Python representation. A single class with
|
|
a type field is the natural fit and aligns with the current `Value` pattern.
|
|
|
|
Single-character type codes provide compact serialization:
|
|
|
|
```python
|
|
from dataclasses import dataclass
|
|
|
|
# Term type constants
|
|
IRI = "i" # IRI/URI node
|
|
BLANK = "b" # Blank node
|
|
LITERAL = "l" # Literal value
|
|
TRIPLE = "t" # Quoted triple (RDF-star)
|
|
|
|
@dataclass
|
|
class Term:
|
|
type: str = "" # One of: IRI, BLANK, LITERAL, TRIPLE
|
|
|
|
# For IRI terms (type == IRI)
|
|
iri: str = ""
|
|
|
|
# For blank nodes (type == BLANK)
|
|
id: str = ""
|
|
|
|
# For literals (type == LITERAL)
|
|
value: str = ""
|
|
datatype: str = "" # XSD datatype URI (mutually exclusive with language)
|
|
language: str = "" # Language tag (mutually exclusive with datatype)
|
|
|
|
# For quoted triples (type == TRIPLE)
|
|
triple: "Triple | None" = None
|
|
```
|
|
|
|
Usage examples:
|
|
|
|
```python
|
|
# IRI term
|
|
node = Term(type=IRI, iri="http://example.org/Alice")
|
|
|
|
# Literal with datatype
|
|
age = Term(type=LITERAL, value="42", datatype="xsd:integer")
|
|
|
|
# Literal with language tag
|
|
label = Term(type=LITERAL, value="Hello", language="en")
|
|
|
|
# Blank node
|
|
anon = Term(type=BLANK, id="_:b1")
|
|
|
|
# Quoted triple (statement about a statement)
|
|
inner = Triple(
|
|
s=Term(type=IRI, iri="http://example.org/Alice"),
|
|
p=Term(type=IRI, iri="http://example.org/knows"),
|
|
o=Term(type=IRI, iri="http://example.org/Bob"),
|
|
)
|
|
reified = Term(type=TRIPLE, triple=inner)
|
|
```
|
|
|
|
##### Alternatives Considered
|
|
|
|
**Option B: Union of specialized classes** (`Term = IRI | BlankNode | Literal | QuotedTriple`)
|
|
- Rejected: Serialization would still need a type discriminator, adding complexity
|
|
|
|
**Option C: Base class with subclasses**
|
|
- Rejected: Same serialization issue, plus dataclass inheritance quirks
|
|
|
|
#### Triple / Quad
|
|
|
|
The `Triple` class gains an optional graph field to become a quad:
|
|
|
|
```python
|
|
@dataclass
|
|
class Triple:
|
|
s: Term | None = None # Subject
|
|
p: Term | None = None # Predicate
|
|
o: Term | None = None # Object
|
|
g: str | None = None # Graph name (IRI), None = default graph
|
|
```
|
|
|
|
Design decisions:
|
|
- **Field name**: `g` for consistency with `s`, `p`, `o`
|
|
- **Optional**: `None` means the default graph (unnamed)
|
|
- **Type**: Plain string (IRI) rather than Term
|
|
- Graph names are always IRIs
|
|
- Blank nodes as graph names ruled out (too confusing)
|
|
- No need for the full Term machinery
|
|
|
|
Note: The class name stays `Triple` even though it's technically a quad now.
|
|
This avoids churn and "triple" is still the common terminology for the s/p/o
|
|
portion. The graph context is metadata about where the triple lives.
|
|
|
|
### Candidate Query Patterns
|
|
|
|
The current query engine accepts combinations of S, P, O terms. With quoted
|
|
triples, a triple itself becomes a valid term in those positions. Below are
|
|
candidate query patterns that support the original goals.
|
|
|
|
#### Graph Parameter Semantics
|
|
|
|
Following SPARQL conventions for backward compatibility:
|
|
|
|
- **`g` omitted / None**: Query the default graph only
|
|
- **`g` = specific IRI**: Query that named graph only
|
|
- **`g` = wildcard / `*`**: Query across all graphs (equivalent to SPARQL
|
|
`GRAPH ?g { ... }`)
|
|
|
|
This keeps simple queries simple and makes named graph queries opt-in.
|
|
|
|
Cross-graph queries (g=wildcard) are fully supported. The Cassandra schema
|
|
includes dedicated tables (SPOG, POSG, OSPG) where g is a clustering column
|
|
rather than a partition key, enabling efficient queries across all graphs.
|
|
|
|
#### Temporal Queries
|
|
|
|
**Find all facts discovered after a given date:**
|
|
```
|
|
S: ? # any quoted triple
|
|
P: <discoveredOn>
|
|
O: > "2024-01-15"^^xsd:date # date comparison
|
|
```
|
|
|
|
**Find when a specific fact was believed true:**
|
|
```
|
|
S: << <Alice> <knows> <Bob> >> # quoted triple as subject
|
|
P: <believedTrueFrom>
|
|
O: ? # returns the date
|
|
```
|
|
|
|
**Find facts that became false:**
|
|
```
|
|
S: ? # any quoted triple
|
|
P: <discoveredFalseOn>
|
|
O: ? # has any value (exists)
|
|
```
|
|
|
|
#### Provenance Queries
|
|
|
|
**Find all facts supported by a specific source:**
|
|
```
|
|
S: ? # any quoted triple
|
|
P: <supportedBy>
|
|
O: <source:document-123>
|
|
```
|
|
|
|
**Find which sources support a specific fact:**
|
|
```
|
|
S: << <DrugA> <treats> <DiseaseB> >> # quoted triple as subject
|
|
P: <supportedBy>
|
|
O: ? # returns source IRIs
|
|
```
|
|
|
|
#### Veracity Queries
|
|
|
|
**Find assertions a person marked as true:**
|
|
```
|
|
S: ? # any quoted triple
|
|
P: <assertedTrueBy>
|
|
O: <person:Alice>
|
|
```
|
|
|
|
**Find conflicting assertions (same fact, different veracity):**
|
|
```
|
|
# First query: facts asserted true
|
|
S: ?
|
|
P: <assertedTrueBy>
|
|
O: ?
|
|
|
|
# Second query: facts asserted false
|
|
S: ?
|
|
P: <assertedFalseBy>
|
|
O: ?
|
|
|
|
# Application logic: find intersection of subjects
|
|
```
|
|
|
|
**Find facts with trust score below threshold:**
|
|
```
|
|
S: ? # any quoted triple
|
|
P: <trustScore>
|
|
O: < 0.5 # numeric comparison
|
|
```
|
|
|
|
### Architecture
|
|
|
|
Significant changes required across multiple components:
|
|
|
|
#### This Repository (trustgraph)
|
|
|
|
- **Schema primitives** (`trustgraph-base/trustgraph/schema/core/primitives.py`)
|
|
- Value → Term rename
|
|
- New Term structure with type discriminator
|
|
- Triple gains `g` field for graph context
|
|
|
|
- **Message translators** (`trustgraph-base/trustgraph/messaging/translators/`)
|
|
- Update for new Term/Triple structures
|
|
- Serialization/deserialization for new fields
|
|
|
|
- **Gateway components**
|
|
- Handle new Term and quad structures
|
|
|
|
- **Knowledge cores**
|
|
- Core changes to support quads and reification
|
|
|
|
- **Knowledge manager**
|
|
- Schema changes propagate here
|
|
|
|
- **Storage layers**
|
|
- Cassandra: Schema redesign (see Implementation Details)
|
|
- Other backends: Deferred to later phases
|
|
|
|
- **Command-line utilities**
|
|
- Update for new data structures
|
|
|
|
- **REST API documentation**
|
|
- OpenAPI spec updates
|
|
|
|
#### External Repositories
|
|
|
|
- **Python API** (this repo)
|
|
- Client library updates for new structures
|
|
|
|
- **TypeScript APIs** (separate repo)
|
|
- Client library updates
|
|
|
|
- **Workbench** (separate repo)
|
|
- Significant state management changes
|
|
|
|
### APIs
|
|
|
|
#### REST API
|
|
|
|
- Documented in OpenAPI spec
|
|
- Will need updates for new Term/Triple structures
|
|
- New endpoints may be needed for graph context operations
|
|
|
|
#### Python API (this repo)
|
|
|
|
- Client library changes to match new primitives
|
|
- Breaking changes to Term (was Value) and Triple
|
|
|
|
#### TypeScript API (separate repo)
|
|
|
|
- Parallel changes to Python API
|
|
- Separate release coordination
|
|
|
|
#### Workbench (separate repo)
|
|
|
|
- Significant state management changes
|
|
- UI updates for graph context features
|
|
|
|
### Implementation Details
|
|
|
|
#### Phased Storage Implementation
|
|
|
|
Multiple graph store backends exist (Cassandra, Neo4j, etc.). Implementation
|
|
will proceed in phases:
|
|
|
|
1. **Phase 1: Cassandra**
|
|
- Start with the home-grown Cassandra store
|
|
- Full control over the storage layer enables rapid iteration
|
|
- Schema will be redesigned from scratch for quads + reification
|
|
- Validate the data model and query patterns against real use cases
|
|
|
|
#### Cassandra Schema Design
|
|
|
|
Cassandra requires multiple tables to support different query access patterns
|
|
(each table efficiently queries by its partition key + clustering columns).
|
|
|
|
##### Query Patterns
|
|
|
|
With quads (g, s, p, o), each position can be specified or wildcard, giving
|
|
16 possible query patterns:
|
|
|
|
| # | g | s | p | o | Description |
|
|
|---|---|---|---|---|-------------|
|
|
| 1 | ? | ? | ? | ? | All quads |
|
|
| 2 | ? | ? | ? | o | By object |
|
|
| 3 | ? | ? | p | ? | By predicate |
|
|
| 4 | ? | ? | p | o | By predicate + object |
|
|
| 5 | ? | s | ? | ? | By subject |
|
|
| 6 | ? | s | ? | o | By subject + object |
|
|
| 7 | ? | s | p | ? | By subject + predicate |
|
|
| 8 | ? | s | p | o | Full triple (which graphs?) |
|
|
| 9 | g | ? | ? | ? | By graph |
|
|
| 10 | g | ? | ? | o | By graph + object |
|
|
| 11 | g | ? | p | ? | By graph + predicate |
|
|
| 12 | g | ? | p | o | By graph + predicate + object |
|
|
| 13 | g | s | ? | ? | By graph + subject |
|
|
| 14 | g | s | ? | o | By graph + subject + object |
|
|
| 15 | g | s | p | ? | By graph + subject + predicate |
|
|
| 16 | g | s | p | o | Exact quad |
|
|
|
|
##### Table Design
|
|
|
|
Cassandra constraint: You can only efficiently query by partition key, then
|
|
filter on clustering columns left-to-right. For g-wildcard queries, g must be
|
|
a clustering column. For g-specified queries, g in the partition key is more
|
|
efficient.
|
|
|
|
**Two table families needed:**
|
|
|
|
**Family A: g-wildcard queries** (g in clustering columns)
|
|
|
|
| Table | Partition | Clustering | Supports patterns |
|
|
|-------|-----------|------------|-------------------|
|
|
| SPOG | (user, collection, s) | p, o, g | 5, 7, 8 |
|
|
| POSG | (user, collection, p) | o, s, g | 3, 4 |
|
|
| OSPG | (user, collection, o) | s, p, g | 2, 6 |
|
|
|
|
**Family B: g-specified queries** (g in partition key)
|
|
|
|
| Table | Partition | Clustering | Supports patterns |
|
|
|-------|-----------|------------|-------------------|
|
|
| GSPO | (user, collection, g, s) | p, o | 9, 13, 15, 16 |
|
|
| GPOS | (user, collection, g, p) | o, s | 11, 12 |
|
|
| GOSP | (user, collection, g, o) | s, p | 10, 14 |
|
|
|
|
**Collection table** (for iteration and bulk deletion)
|
|
|
|
| Table | Partition | Clustering | Purpose |
|
|
|-------|-----------|------------|---------|
|
|
| COLL | (user, collection) | g, s, p, o | Enumerate all quads in collection |
|
|
|
|
##### Write and Delete Paths
|
|
|
|
**Write path**: Insert into all 7 tables.
|
|
|
|
**Delete collection path**:
|
|
1. Iterate COLL table for `(user, collection)`
|
|
2. For each quad, delete from all 6 query tables
|
|
3. Delete from COLL table (or range delete)
|
|
|
|
**Delete single quad path**: Delete from all 7 tables directly.
|
|
|
|
##### Storage Cost
|
|
|
|
Each quad is stored 7 times. This is the cost of flexible querying combined
|
|
with efficient collection deletion.
|
|
|
|
##### Quoted Triples in Storage
|
|
|
|
Subject or object can be a triple itself. Options:
|
|
|
|
**Option A: Serialize quoted triples to canonical string**
|
|
```
|
|
S: "<<http://ex/Alice|http://ex/knows|http://ex/Bob>>"
|
|
P: http://ex/discoveredOn
|
|
O: "2024-01-15"
|
|
G: null
|
|
```
|
|
- Store quoted triple as serialized string in S or O columns
|
|
- Query by exact match on serialized form
|
|
- Pro: Simple, fits existing index patterns
|
|
- Con: Can't query "find triples where quoted subject's predicate is X"
|
|
|
|
**Option B: Triple IDs / Hashes**
|
|
```
|
|
Triple table:
|
|
id: hash(s,p,o,g)
|
|
s, p, o, g: ...
|
|
|
|
Metadata table:
|
|
subject_triple_id: <hash>
|
|
p: http://ex/discoveredOn
|
|
o: "2024-01-15"
|
|
```
|
|
- Assign each triple an ID (hash of components)
|
|
- Reification metadata references triples by ID
|
|
- Pro: Clean separation, can index triple IDs
|
|
- Con: Requires computing/managing triple identity, two-phase lookups
|
|
|
|
**Recommendation**: Start with Option A (serialized strings) for simplicity.
|
|
Option B may be needed if advanced query patterns over quoted triple
|
|
components are required.
|
|
|
|
2. **Phase 2+: Other Backends**
|
|
- Neo4j and other stores implemented in subsequent stages
|
|
- Lessons learned from Cassandra inform these implementations
|
|
|
|
This approach de-risks the design by validating on a fully-controlled backend
|
|
before committing to implementations across all stores.
|
|
|
|
#### Value → Term Rename
|
|
|
|
The `Value` class will be renamed to `Term`. This affects ~78 files across
|
|
the codebase. The rename acts as a forcing function: any code still using
|
|
`Value` is immediately identifiable as needing review/update for 2.0
|
|
compatibility.
|
|
|
|
## Security Considerations
|
|
|
|
Named graphs are not a security feature. Users and collections remain the
|
|
security boundaries. Named graphs are purely for data organization and
|
|
reification support.
|
|
|
|
## Performance Considerations
|
|
|
|
- Quoted triples add nesting depth - may impact query performance
|
|
- Named graph indexing strategies needed for efficient graph-scoped queries
|
|
- Cassandra schema design will need to accommodate quad storage efficiently
|
|
|
|
### Vector Store Boundary
|
|
|
|
Vector stores always reference IRIs only:
|
|
- Never edges (quoted triples)
|
|
- Never literal values
|
|
- Never blank nodes
|
|
|
|
This keeps the vector store simple - it handles semantic similarity of named
|
|
entities. The graph structure handles relationships, reification, and metadata.
|
|
Quoted triples and named graphs don't complicate vector operations.
|
|
|
|
## Testing Strategy
|
|
|
|
Use existing test strategy. As this is a breaking change, extensive focus on
|
|
the end-to-end test suite to validate the new structures work correctly across
|
|
all components.
|
|
|
|
## Migration Plan
|
|
|
|
- 2.0 is a breaking release; no backward compatibility required
|
|
- Existing data may need migration to new schema (TBD based on final design)
|
|
- Consider migration tooling for converting existing triples
|
|
|
|
## Open Questions
|
|
|
|
- **Blank nodes**: Limited support confirmed. May need to decide on
|
|
skolemization strategy (generate IRIs on load, or preserve blank node IDs).
|
|
- **Query syntax**: What is the concrete syntax for specifying quoted triples
|
|
in queries? Need to define the query API.
|
|
- ~~**Predicate vocabulary**~~: Resolved. Any valid RDF predicates permitted,
|
|
including custom user-defined. Minimal assumptions about RDF validity.
|
|
Very few locked-in values (e.g., `rdfs:label` used in some places).
|
|
Strategy: avoid locking anything in unless absolutely necessary.
|
|
- ~~**Vector store impact**~~: Resolved. Vector stores always point to IRIs
|
|
only - never edges, literals, or blank nodes. Quoted triples and
|
|
reification don't affect the vector store.
|
|
- ~~**Named graph semantics**~~: Resolved. Queries default to the default
|
|
graph (matches SPARQL behavior, backward compatible). Explicit graph
|
|
parameter required to query named graphs or all graphs.
|
|
|
|
## References
|
|
|
|
- [RDF 1.2 Concepts](https://www.w3.org/TR/rdf12-concepts/)
|
|
- [RDF-star and SPARQL-star](https://w3c.github.io/rdf-star/)
|
|
- [RDF Dataset](https://www.w3.org/TR/rdf11-concepts/#section-dataset)
|