--- layout: default title: "Graph Contexts Technical Specification" parent: "Tech Specs" --- # Graph Contexts Technical Specification ## Overview This specification describes changes to TrustGraph's core graph primitives to align with RDF 1.2 and support full RDF Dataset semantics. This is a breaking change for the 2.x release series. ### Versioning - **2.0**: Early adopter release. Core features available, may not be fully production-ready. - **2.1 / 2.2**: Production release. Stability and completeness validated. Flexibility on maturity is intentional - early adopters can access new capabilities before all features are production-hardened. ## Goals The primary goals for this work are to enable metadata about facts/statements: - **Temporal information**: Associate facts with time metadata - When a fact was believed to be true - When a fact became true - When a fact was discovered to be false - **Provenance/Sources**: Track which sources support a fact - "This fact was supported by source X" - Link facts back to their origin documents - **Veracity/Trust**: Record assertions about truth - "Person P asserted this was true" - "Person Q claims this is false" - Enable trust scoring and conflict detection **Hypothesis**: Reification (RDF-star / quoted triples) is the key mechanism to achieve these outcomes, as all require making statements about statements. ## Background To express "the fact (Alice knows Bob) was discovered on 2024-01-15" or "source X supports the claim (Y causes Z)", you need to reference an edge as a thing you can make statements about. Standard triples don't support this. ### Current Limitations The current `Value` class in `trustgraph-base/trustgraph/schema/core/primitives.py` can represent: - URI nodes (`is_uri=True`) - Literal values (`is_uri=False`) The `type` field exists but is not used to represent XSD datatypes. ## Technical Design ### RDF Features to Support #### Core Features (Related to Reification Goals) These features are directly related to the temporal, provenance, and veracity goals: 1. **RDF 1.2 Quoted Triples (RDF-star)** - Edges that point at other edges - A Triple can appear as the subject or object of another Triple - Enables statements about statements (reification) - Core mechanism for annotating individual facts 2. **RDF Dataset / Named Graphs** - Support for multiple named graphs within a dataset - Each graph identified by an IRI - Moves from triples (s, p, o) to quads (s, p, o, g) - Includes a default graph plus zero or more named graphs - The graph IRI can be a subject in statements, e.g.: ``` "2024-01-15" "high" ``` - Note: Named graphs are a separate feature from reification. They have uses beyond statement annotation (partitioning, access control, dataset organization) and should be treated as a distinct capability. 3. **Blank Nodes** (Limited Support) - Anonymous nodes without a global URI - Supported for compatibility when loading external RDF data - **Limited status**: No guarantees about stable identity after loading - Find them via wildcard queries (match by connections, not by ID) - Not a first-class feature - don't rely on precise blank node handling #### Opportunistic Fixes (2.0 Breaking Change) These features are not directly related to the reification goals but are valuable improvements to include while making breaking changes: 4. **Literal Datatypes** - Properly use the `type` field for XSD datatypes - Examples: xsd:string, xsd:integer, xsd:dateTime, etc. - Fixes current limitation: cannot represent dates or integers properly 5. **Language Tags** - Support for language attributes on string literals (@en, @fr, etc.) - Note: A literal has either a language tag OR a datatype, not both (except for rdf:langString) - Important for AI/multilingual use cases ### Data Models #### Term (rename from Value) The `Value` class will be renamed to `Term` to better reflect RDF terminology. This rename serves two purposes: 1. Aligns naming with RDF concepts (a "Term" can be an IRI, literal, blank node, or quoted triple - not just a "value") 2. Forces code review at the breaking change interface - any code still referencing `Value` is visibly broken and needs updating A Term can represent: - **IRI/URI** - A named node/resource - **Blank Node** - An anonymous node with local scope - **Literal** - A data value with either: - A datatype (XSD type), OR - A language tag - **Quoted Triple** - A triple used as a term (RDF 1.2) ##### Chosen Approach: Single Class with Type Discriminator Serialization requirements drive the structure - a type discriminator is needed in the wire format regardless of the Python representation. A single class with a type field is the natural fit and aligns with the current `Value` pattern. Single-character type codes provide compact serialization: ```python from dataclasses import dataclass # Term type constants IRI = "i" # IRI/URI node BLANK = "b" # Blank node LITERAL = "l" # Literal value TRIPLE = "t" # Quoted triple (RDF-star) @dataclass class Term: type: str = "" # One of: IRI, BLANK, LITERAL, TRIPLE # For IRI terms (type == IRI) iri: str = "" # For blank nodes (type == BLANK) id: str = "" # For literals (type == LITERAL) value: str = "" datatype: str = "" # XSD datatype URI (mutually exclusive with language) language: str = "" # Language tag (mutually exclusive with datatype) # For quoted triples (type == TRIPLE) triple: "Triple | None" = None ``` Usage examples: ```python # IRI term node = Term(type=IRI, iri="http://example.org/Alice") # Literal with datatype age = Term(type=LITERAL, value="42", datatype="xsd:integer") # Literal with language tag label = Term(type=LITERAL, value="Hello", language="en") # Blank node anon = Term(type=BLANK, id="_:b1") # Quoted triple (statement about a statement) inner = Triple( s=Term(type=IRI, iri="http://example.org/Alice"), p=Term(type=IRI, iri="http://example.org/knows"), o=Term(type=IRI, iri="http://example.org/Bob"), ) reified = Term(type=TRIPLE, triple=inner) ``` ##### Alternatives Considered **Option B: Union of specialized classes** (`Term = IRI | BlankNode | Literal | QuotedTriple`) - Rejected: Serialization would still need a type discriminator, adding complexity **Option C: Base class with subclasses** - Rejected: Same serialization issue, plus dataclass inheritance quirks #### Triple / Quad The `Triple` class gains an optional graph field to become a quad: ```python @dataclass class Triple: s: Term | None = None # Subject p: Term | None = None # Predicate o: Term | None = None # Object g: str | None = None # Graph name (IRI), None = default graph ``` Design decisions: - **Field name**: `g` for consistency with `s`, `p`, `o` - **Optional**: `None` means the default graph (unnamed) - **Type**: Plain string (IRI) rather than Term - Graph names are always IRIs - Blank nodes as graph names ruled out (too confusing) - No need for the full Term machinery Note: The class name stays `Triple` even though it's technically a quad now. This avoids churn and "triple" is still the common terminology for the s/p/o portion. The graph context is metadata about where the triple lives. ### Candidate Query Patterns The current query engine accepts combinations of S, P, O terms. With quoted triples, a triple itself becomes a valid term in those positions. Below are candidate query patterns that support the original goals. #### Graph Parameter Semantics Following SPARQL conventions for backward compatibility: - **`g` omitted / None**: Query the default graph only - **`g` = specific IRI**: Query that named graph only - **`g` = wildcard / `*`**: Query across all graphs (equivalent to SPARQL `GRAPH ?g { ... }`) This keeps simple queries simple and makes named graph queries opt-in. Cross-graph queries (g=wildcard) are fully supported. The Cassandra schema includes dedicated tables (SPOG, POSG, OSPG) where g is a clustering column rather than a partition key, enabling efficient queries across all graphs. #### Temporal Queries **Find all facts discovered after a given date:** ``` S: ? # any quoted triple P: O: > "2024-01-15"^^xsd:date # date comparison ``` **Find when a specific fact was believed true:** ``` S: << >> # quoted triple as subject P: O: ? # returns the date ``` **Find facts that became false:** ``` S: ? # any quoted triple P: O: ? # has any value (exists) ``` #### Provenance Queries **Find all facts supported by a specific source:** ``` S: ? # any quoted triple P: O: ``` **Find which sources support a specific fact:** ``` S: << >> # quoted triple as subject P: O: ? # returns source IRIs ``` #### Veracity Queries **Find assertions a person marked as true:** ``` S: ? # any quoted triple P: O: ``` **Find conflicting assertions (same fact, different veracity):** ``` # First query: facts asserted true S: ? P: O: ? # Second query: facts asserted false S: ? P: O: ? # Application logic: find intersection of subjects ``` **Find facts with trust score below threshold:** ``` S: ? # any quoted triple P: O: < 0.5 # numeric comparison ``` ### Architecture Significant changes required across multiple components: #### This Repository (trustgraph) - **Schema primitives** (`trustgraph-base/trustgraph/schema/core/primitives.py`) - Value → Term rename - New Term structure with type discriminator - Triple gains `g` field for graph context - **Message translators** (`trustgraph-base/trustgraph/messaging/translators/`) - Update for new Term/Triple structures - Serialization/deserialization for new fields - **Gateway components** - Handle new Term and quad structures - **Knowledge cores** - Core changes to support quads and reification - **Knowledge manager** - Schema changes propagate here - **Storage layers** - Cassandra: Schema redesign (see Implementation Details) - Other backends: Deferred to later phases - **Command-line utilities** - Update for new data structures - **REST API documentation** - OpenAPI spec updates #### External Repositories - **Python API** (this repo) - Client library updates for new structures - **TypeScript APIs** (separate repo) - Client library updates - **Workbench** (separate repo) - Significant state management changes ### APIs #### REST API - Documented in OpenAPI spec - Will need updates for new Term/Triple structures - New endpoints may be needed for graph context operations #### Python API (this repo) - Client library changes to match new primitives - Breaking changes to Term (was Value) and Triple #### TypeScript API (separate repo) - Parallel changes to Python API - Separate release coordination #### Workbench (separate repo) - Significant state management changes - UI updates for graph context features ### Implementation Details #### Phased Storage Implementation Multiple graph store backends exist (Cassandra, Neo4j, etc.). Implementation will proceed in phases: 1. **Phase 1: Cassandra** - Start with the home-grown Cassandra store - Full control over the storage layer enables rapid iteration - Schema will be redesigned from scratch for quads + reification - Validate the data model and query patterns against real use cases #### Cassandra Schema Design Cassandra requires multiple tables to support different query access patterns (each table efficiently queries by its partition key + clustering columns). ##### Query Patterns With quads (g, s, p, o), each position can be specified or wildcard, giving 16 possible query patterns: | # | g | s | p | o | Description | |---|---|---|---|---|-------------| | 1 | ? | ? | ? | ? | All quads | | 2 | ? | ? | ? | o | By object | | 3 | ? | ? | p | ? | By predicate | | 4 | ? | ? | p | o | By predicate + object | | 5 | ? | s | ? | ? | By subject | | 6 | ? | s | ? | o | By subject + object | | 7 | ? | s | p | ? | By subject + predicate | | 8 | ? | s | p | o | Full triple (which graphs?) | | 9 | g | ? | ? | ? | By graph | | 10 | g | ? | ? | o | By graph + object | | 11 | g | ? | p | ? | By graph + predicate | | 12 | g | ? | p | o | By graph + predicate + object | | 13 | g | s | ? | ? | By graph + subject | | 14 | g | s | ? | o | By graph + subject + object | | 15 | g | s | p | ? | By graph + subject + predicate | | 16 | g | s | p | o | Exact quad | ##### Table Design Cassandra constraint: You can only efficiently query by partition key, then filter on clustering columns left-to-right. For g-wildcard queries, g must be a clustering column. For g-specified queries, g in the partition key is more efficient. **Two table families needed:** **Family A: g-wildcard queries** (g in clustering columns) | Table | Partition | Clustering | Supports patterns | |-------|-----------|------------|-------------------| | SPOG | (user, collection, s) | p, o, g | 5, 7, 8 | | POSG | (user, collection, p) | o, s, g | 3, 4 | | OSPG | (user, collection, o) | s, p, g | 2, 6 | **Family B: g-specified queries** (g in partition key) | Table | Partition | Clustering | Supports patterns | |-------|-----------|------------|-------------------| | GSPO | (user, collection, g, s) | p, o | 9, 13, 15, 16 | | GPOS | (user, collection, g, p) | o, s | 11, 12 | | GOSP | (user, collection, g, o) | s, p | 10, 14 | **Collection table** (for iteration and bulk deletion) | Table | Partition | Clustering | Purpose | |-------|-----------|------------|---------| | COLL | (user, collection) | g, s, p, o | Enumerate all quads in collection | ##### Write and Delete Paths **Write path**: Insert into all 7 tables. **Delete collection path**: 1. Iterate COLL table for `(user, collection)` 2. For each quad, delete from all 6 query tables 3. Delete from COLL table (or range delete) **Delete single quad path**: Delete from all 7 tables directly. ##### Storage Cost Each quad is stored 7 times. This is the cost of flexible querying combined with efficient collection deletion. ##### Quoted Triples in Storage Subject or object can be a triple itself. Options: **Option A: Serialize quoted triples to canonical string** ``` S: "<>" P: http://ex/discoveredOn O: "2024-01-15" G: null ``` - Store quoted triple as serialized string in S or O columns - Query by exact match on serialized form - Pro: Simple, fits existing index patterns - Con: Can't query "find triples where quoted subject's predicate is X" **Option B: Triple IDs / Hashes** ``` Triple table: id: hash(s,p,o,g) s, p, o, g: ... Metadata table: subject_triple_id: p: http://ex/discoveredOn o: "2024-01-15" ``` - Assign each triple an ID (hash of components) - Reification metadata references triples by ID - Pro: Clean separation, can index triple IDs - Con: Requires computing/managing triple identity, two-phase lookups **Recommendation**: Start with Option A (serialized strings) for simplicity. Option B may be needed if advanced query patterns over quoted triple components are required. 2. **Phase 2+: Other Backends** - Neo4j and other stores implemented in subsequent stages - Lessons learned from Cassandra inform these implementations This approach de-risks the design by validating on a fully-controlled backend before committing to implementations across all stores. #### Value → Term Rename The `Value` class will be renamed to `Term`. This affects ~78 files across the codebase. The rename acts as a forcing function: any code still using `Value` is immediately identifiable as needing review/update for 2.0 compatibility. ## Security Considerations Named graphs are not a security feature. Users and collections remain the security boundaries. Named graphs are purely for data organization and reification support. ## Performance Considerations - Quoted triples add nesting depth - may impact query performance - Named graph indexing strategies needed for efficient graph-scoped queries - Cassandra schema design will need to accommodate quad storage efficiently ### Vector Store Boundary Vector stores always reference IRIs only: - Never edges (quoted triples) - Never literal values - Never blank nodes This keeps the vector store simple - it handles semantic similarity of named entities. The graph structure handles relationships, reification, and metadata. Quoted triples and named graphs don't complicate vector operations. ## Testing Strategy Use existing test strategy. As this is a breaking change, extensive focus on the end-to-end test suite to validate the new structures work correctly across all components. ## Migration Plan - 2.0 is a breaking release; no backward compatibility required - Existing data may need migration to new schema (TBD based on final design) - Consider migration tooling for converting existing triples ## Open Questions - **Blank nodes**: Limited support confirmed. May need to decide on skolemization strategy (generate IRIs on load, or preserve blank node IDs). - **Query syntax**: What is the concrete syntax for specifying quoted triples in queries? Need to define the query API. - ~~**Predicate vocabulary**~~: Resolved. Any valid RDF predicates permitted, including custom user-defined. Minimal assumptions about RDF validity. Very few locked-in values (e.g., `rdfs:label` used in some places). Strategy: avoid locking anything in unless absolutely necessary. - ~~**Vector store impact**~~: Resolved. Vector stores always point to IRIs only - never edges, literals, or blank nodes. Quoted triples and reification don't affect the vector store. - ~~**Named graph semantics**~~: Resolved. Queries default to the default graph (matches SPARQL behavior, backward compatible). Explicit graph parameter required to query named graphs or all graphs. ## References - [RDF 1.2 Concepts](https://www.w3.org/TR/rdf12-concepts/) - [RDF-star and SPARQL-star](https://w3c.github.io/rdf-star/) - [RDF Dataset](https://www.w3.org/TR/rdf11-concepts/#section-dataset)