diff --git a/docs/tech-specs/graph-contexts.md b/docs/tech-specs/graph-contexts.md new file mode 100644 index 00000000..8ddb3952 --- /dev/null +++ b/docs/tech-specs/graph-contexts.md @@ -0,0 +1,522 @@ +# Graph Contexts Technical Specification + +## Overview + +This specification describes changes to TrustGraph's core graph primitives to +align with RDF 1.2 and support full RDF Dataset semantics. This is a breaking +change for the 2.x release series. + +### Versioning + +- **2.0**: Early adopter release. Core features available, may not be fully + production-ready. +- **2.1 / 2.2**: Production release. Stability and completeness validated. + +Flexibility on maturity is intentional - early adopters can access new +capabilities before all features are production-hardened. + +## Goals + +The primary goals for this work are to enable metadata about facts/statements: + +- **Temporal information**: Associate facts with time metadata + - When a fact was believed to be true + - When a fact became true + - When a fact was discovered to be false + +- **Provenance/Sources**: Track which sources support a fact + - "This fact was supported by source X" + - Link facts back to their origin documents + +- **Veracity/Trust**: Record assertions about truth + - "Person P asserted this was true" + - "Person Q claims this is false" + - Enable trust scoring and conflict detection + +**Hypothesis**: Reification (RDF-star / quoted triples) is the key mechanism +to achieve these outcomes, as all require making statements about statements. + +## Background + +To express "the fact (Alice knows Bob) was discovered on 2024-01-15" or +"source X supports the claim (Y causes Z)", you need to reference an edge +as a thing you can make statements about. Standard triples don't support this. + +### Current Limitations + +The current `Value` class in `trustgraph-base/trustgraph/schema/core/primitives.py` +can represent: +- URI nodes (`is_uri=True`) +- Literal values (`is_uri=False`) + +The `type` field exists but is not used to represent XSD datatypes. + +## Technical Design + +### RDF Features to Support + +#### Core Features (Related to Reification Goals) + +These features are directly related to the temporal, provenance, and veracity +goals: + +1. **RDF 1.2 Quoted Triples (RDF-star)** + - Edges that point at other edges + - A Triple can appear as the subject or object of another Triple + - Enables statements about statements (reification) + - Core mechanism for annotating individual facts + +2. **RDF Dataset / Named Graphs** + - Support for multiple named graphs within a dataset + - Each graph identified by an IRI + - Moves from triples (s, p, o) to quads (s, p, o, g) + - Includes a default graph plus zero or more named graphs + - The graph IRI can be a subject in statements, e.g.: + ``` + "2024-01-15" + "high" + ``` + - Note: Named graphs are a separate feature from reification. They have + uses beyond statement annotation (partitioning, access control, dataset + organization) and should be treated as a distinct capability. + +3. **Blank Nodes** (Limited Support) + - Anonymous nodes without a global URI + - Supported for compatibility when loading external RDF data + - **Limited status**: No guarantees about stable identity after loading + - Find them via wildcard queries (match by connections, not by ID) + - Not a first-class feature - don't rely on precise blank node handling + +#### Opportunistic Fixes (2.0 Breaking Change) + +These features are not directly related to the reification goals but are +valuable improvements to include while making breaking changes: + +4. **Literal Datatypes** + - Properly use the `type` field for XSD datatypes + - Examples: xsd:string, xsd:integer, xsd:dateTime, etc. + - Fixes current limitation: cannot represent dates or integers properly + +5. **Language Tags** + - Support for language attributes on string literals (@en, @fr, etc.) + - Note: A literal has either a language tag OR a datatype, not both + (except for rdf:langString) + - Important for AI/multilingual use cases + +### Data Models + +#### Term (rename from Value) + +The `Value` class will be renamed to `Term` to better reflect RDF terminology. +This rename serves two purposes: +1. Aligns naming with RDF concepts (a "Term" can be an IRI, literal, blank + node, or quoted triple - not just a "value") +2. Forces code review at the breaking change interface - any code still + referencing `Value` is visibly broken and needs updating + +A Term can represent: + +- **IRI/URI** - A named node/resource +- **Blank Node** - An anonymous node with local scope +- **Literal** - A data value with either: + - A datatype (XSD type), OR + - A language tag +- **Quoted Triple** - A triple used as a term (RDF 1.2) + +##### Chosen Approach: Single Class with Type Discriminator + +Serialization requirements drive the structure - a type discriminator is needed +in the wire format regardless of the Python representation. A single class with +a type field is the natural fit and aligns with the current `Value` pattern. + +Single-character type codes provide compact serialization: + +```python +from dataclasses import dataclass + +# Term type constants +IRI = "i" # IRI/URI node +BLANK = "b" # Blank node +LITERAL = "l" # Literal value +TRIPLE = "t" # Quoted triple (RDF-star) + +@dataclass +class Term: + type: str = "" # One of: IRI, BLANK, LITERAL, TRIPLE + + # For IRI terms (type == IRI) + iri: str = "" + + # For blank nodes (type == BLANK) + id: str = "" + + # For literals (type == LITERAL) + value: str = "" + datatype: str = "" # XSD datatype URI (mutually exclusive with language) + language: str = "" # Language tag (mutually exclusive with datatype) + + # For quoted triples (type == TRIPLE) + triple: "Triple | None" = None +``` + +Usage examples: + +```python +# IRI term +node = Term(type=IRI, iri="http://example.org/Alice") + +# Literal with datatype +age = Term(type=LITERAL, value="42", datatype="xsd:integer") + +# Literal with language tag +label = Term(type=LITERAL, value="Hello", language="en") + +# Blank node +anon = Term(type=BLANK, id="_:b1") + +# Quoted triple (statement about a statement) +inner = Triple( + s=Term(type=IRI, iri="http://example.org/Alice"), + p=Term(type=IRI, iri="http://example.org/knows"), + o=Term(type=IRI, iri="http://example.org/Bob"), +) +reified = Term(type=TRIPLE, triple=inner) +``` + +##### Alternatives Considered + +**Option B: Union of specialized classes** (`Term = IRI | BlankNode | Literal | QuotedTriple`) +- Rejected: Serialization would still need a type discriminator, adding complexity + +**Option C: Base class with subclasses** +- Rejected: Same serialization issue, plus dataclass inheritance quirks + +#### Triple / Quad + +The `Triple` class gains an optional graph field to become a quad: + +```python +@dataclass +class Triple: + s: Term | None = None # Subject + p: Term | None = None # Predicate + o: Term | None = None # Object + g: str | None = None # Graph name (IRI), None = default graph +``` + +Design decisions: +- **Field name**: `g` for consistency with `s`, `p`, `o` +- **Optional**: `None` means the default graph (unnamed) +- **Type**: Plain string (IRI) rather than Term + - Graph names are always IRIs + - Blank nodes as graph names ruled out (too confusing) + - No need for the full Term machinery + +Note: The class name stays `Triple` even though it's technically a quad now. +This avoids churn and "triple" is still the common terminology for the s/p/o +portion. The graph context is metadata about where the triple lives. + +### Candidate Query Patterns + +The current query engine accepts combinations of S, P, O terms. With quoted +triples, a triple itself becomes a valid term in those positions. Below are +candidate query patterns that support the original goals. + +#### Graph Parameter Semantics + +Following SPARQL conventions for backward compatibility: + +- **`g` omitted / None**: Query the default graph only +- **`g` = specific IRI**: Query that named graph only +- **`g` = wildcard / `*`**: Query across all graphs + +This keeps simple queries simple and makes named graph queries opt-in. + +#### Temporal Queries + +**Find all facts discovered after a given date:** +``` +S: ? # any quoted triple +P: +O: > "2024-01-15"^^xsd:date # date comparison +``` + +**Find when a specific fact was believed true:** +``` +S: << >> # quoted triple as subject +P: +O: ? # returns the date +``` + +**Find facts that became false:** +``` +S: ? # any quoted triple +P: +O: ? # has any value (exists) +``` + +#### Provenance Queries + +**Find all facts supported by a specific source:** +``` +S: ? # any quoted triple +P: +O: +``` + +**Find which sources support a specific fact:** +``` +S: << >> # quoted triple as subject +P: +O: ? # returns source IRIs +``` + +#### Veracity Queries + +**Find assertions a person marked as true:** +``` +S: ? # any quoted triple +P: +O: +``` + +**Find conflicting assertions (same fact, different veracity):** +``` +# First query: facts asserted true +S: ? +P: +O: ? + +# Second query: facts asserted false +S: ? +P: +O: ? + +# Application logic: find intersection of subjects +``` + +**Find facts with trust score below threshold:** +``` +S: ? # any quoted triple +P: +O: < 0.5 # numeric comparison +``` + +### Architecture + +Significant changes required across multiple components: + +#### This Repository (trustgraph) + +- **Schema primitives** (`trustgraph-base/trustgraph/schema/core/primitives.py`) + - Value → Term rename + - New Term structure with type discriminator + - Triple gains `g` field for graph context + +- **Message translators** (`trustgraph-base/trustgraph/messaging/translators/`) + - Update for new Term/Triple structures + - Serialization/deserialization for new fields + +- **Gateway components** + - Handle new Term and quad structures + +- **Knowledge cores** + - Core changes to support quads and reification + +- **Knowledge manager** + - Schema changes propagate here + +- **Storage layers** + - Cassandra: Schema redesign (see Implementation Details) + - Other backends: Deferred to later phases + +- **Command-line utilities** + - Update for new data structures + +- **REST API documentation** + - OpenAPI spec updates + +#### External Repositories + +- **Python API** (this repo) + - Client library updates for new structures + +- **TypeScript APIs** (separate repo) + - Client library updates + +- **Workbench** (separate repo) + - Significant state management changes + +### APIs + +#### REST API + +- Documented in OpenAPI spec +- Will need updates for new Term/Triple structures +- New endpoints may be needed for graph context operations + +#### Python API (this repo) + +- Client library changes to match new primitives +- Breaking changes to Term (was Value) and Triple + +#### TypeScript API (separate repo) + +- Parallel changes to Python API +- Separate release coordination + +#### Workbench (separate repo) + +- Significant state management changes +- UI updates for graph context features + +### Implementation Details + +#### Phased Storage Implementation + +Multiple graph store backends exist (Cassandra, Neo4j, etc.). Implementation +will proceed in phases: + +1. **Phase 1: Cassandra** + - Start with the home-grown Cassandra store + - Full control over the storage layer enables rapid iteration + - Schema will be redesigned from scratch for quads + reification + - Validate the data model and query patterns against real use cases + +#### Cassandra Schema Design + +Cassandra requires multiple tables to support different query access patterns +(each table efficiently queries by its partition key + clustering columns). + +**Challenge: Quads** + +For triples, typical indexes are SPO, POS, OSP (partition by first, cluster by +rest). For quads, the graph dimension adds: SPOG, POSG, OSPG, GSPO, etc. + +**Challenge: Quoted Triples** + +Subject or object can be a triple itself. Options: + +**Option A: Serialize quoted triples to canonical string** +``` +S: "<>" +P: http://ex/discoveredOn +O: "2024-01-15" +G: null +``` +- Store quoted triple as serialized string in S or O columns +- Query by exact match on serialized form +- Pro: Simple, fits existing index patterns +- Con: Can't query "find triples where quoted subject's predicate is X" + +**Option B: Triple IDs / Hashes** +``` +Triple table: + id: hash(s,p,o,g) + s, p, o, g: ... + +Metadata table: + subject_triple_id: + p: http://ex/discoveredOn + o: "2024-01-15" +``` +- Assign each triple an ID (hash of components) +- Reification metadata references triples by ID +- Pro: Clean separation, can index triple IDs +- Con: Requires computing/managing triple identity, two-phase lookups + +**Option C: Hybrid** +- Store quads normally with serialized quoted triple strings for simple cases +- Maintain a separate triple ID lookup for advanced queries +- Pro: Flexibility +- Con: Complexity + +**Recommendation**: TBD after prototyping. Option A is simplest for initial +implementation; Option B may be needed for advanced query patterns. + +#### Indexing Strategy + +Indexes must support the defined query patterns: + +| Query Type | Access Pattern | Index Needed | +|------------|----------------|--------------| +| Facts by date | P=discoveredOn, O>date | POG (predicate, object, graph) | +| Facts by source | P=supportedBy, O=source | POG | +| Facts by asserter | P=assertedBy, O=person | POG | +| Metadata for a fact | S=quotedTriple | SPO/SPOG | +| All facts in graph | G=graphIRI | GSPO | + +For temporal range queries (dates), Cassandra clustering column ordering +enables efficient scans when date is a clustering column. + +2. **Phase 2+: Other Backends** + - Neo4j and other stores implemented in subsequent stages + - Lessons learned from Cassandra inform these implementations + +This approach de-risks the design by validating on a fully-controlled backend +before committing to implementations across all stores. + +#### Value → Term Rename + +The `Value` class will be renamed to `Term`. This affects ~78 files across +the codebase. The rename acts as a forcing function: any code still using +`Value` is immediately identifiable as needing review/update for 2.0 +compatibility. + +## Security Considerations + +Named graphs are not a security feature. Users and collections remain the +security boundaries. Named graphs are purely for data organization and +reification support. + +## Performance Considerations + +- Quoted triples add nesting depth - may impact query performance +- Named graph indexing strategies needed for efficient graph-scoped queries +- Cassandra schema design will need to accommodate quad storage efficiently + +### Vector Store Boundary + +Vector stores always reference IRIs only: +- Never edges (quoted triples) +- Never literal values +- Never blank nodes + +This keeps the vector store simple - it handles semantic similarity of named +entities. The graph structure handles relationships, reification, and metadata. +Quoted triples and named graphs don't complicate vector operations. + +## Testing Strategy + +Use existing test strategy. As this is a breaking change, extensive focus on +the end-to-end test suite to validate the new structures work correctly across +all components. + +## Migration Plan + +- 2.0 is a breaking release; no backward compatibility required +- Existing data may need migration to new schema (TBD based on final design) +- Consider migration tooling for converting existing triples + +## Open Questions + +- **Blank nodes**: Limited support confirmed. May need to decide on + skolemization strategy (generate IRIs on load, or preserve blank node IDs). +- **Query syntax**: What is the concrete syntax for specifying quoted triples + in queries? Need to define the query API. +- ~~**Predicate vocabulary**~~: Resolved. Any valid RDF predicates permitted, + including custom user-defined. Minimal assumptions about RDF validity. + Very few locked-in values (e.g., `rdfs:label` used in some places). + Strategy: avoid locking anything in unless absolutely necessary. +- ~~**Vector store impact**~~: Resolved. Vector stores always point to IRIs + only - never edges, literals, or blank nodes. Quoted triples and + reification don't affect the vector store. +- ~~**Named graph semantics**~~: Resolved. Queries default to the default + graph (matches SPARQL behavior, backward compatible). Explicit graph + parameter required to query named graphs or all graphs. + +## References + +- [RDF 1.2 Concepts](https://www.w3.org/TR/rdf12-concepts/) +- [RDF-star and SPARQL-star](https://w3c.github.io/rdf-star/) +- [RDF Dataset](https://www.w3.org/TR/rdf11-concepts/#section-dataset)