mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 00:16:23 +02:00
1. Shared Provenance Module - URI generators, namespace constants,
triple builders, vocabulary bootstrap
2. Librarian - Emits document metadata to graph on processing
initiation (vocabulary bootstrap + PROV-O triples)
3. PDF Extractor - Saves pages as child documents, emits parent-child
provenance edges, forwards page IDs
4. Chunker - Saves chunks as child documents, emits provenance edges,
forwards chunk ID + content
5. Knowledge Extractors (both definitions and relationships):
- Link entities to chunks via SUBJECT_OF (not top-level document)
- Removed duplicate metadata emission (now handled by librarian)
- Get chunk_doc_id and chunk_uri from incoming Chunk message
6. Embedding Provenance:
- EntityContext schema has chunk_id field
- EntityEmbeddings schema has chunk_id field
- Definitions extractor sets chunk_id when creating EntityContext
- Graph embeddings processor passes chunk_id through to
EntityEmbeddings
Provenance Flow:
Document → Page (PDF) → Chunk → Extracted Facts/Embeddings
↓ ↓ ↓ ↓
librarian librarian librarian (chunk_id reference)
+ graph + graph + graph
Each artifact is stored in librarian with parent-child linking, and PROV-O
edges are emitted to the knowledge graph for full traceability from any
extracted fact back to its source document.
Also, updating tests
33 lines
951 B
Python
33 lines
951 B
Python
from dataclasses import dataclass, field
|
|
|
|
from ..core.primitives import Term, Triple
|
|
from ..core.metadata import Metadata
|
|
from ..core.topic import topic
|
|
|
|
############################################################################
|
|
|
|
# Entity context are an entity associated with textual context
|
|
|
|
@dataclass
|
|
class EntityContext:
|
|
entity: Term | None = None
|
|
context: str = ""
|
|
# Provenance: which chunk this entity context was derived from
|
|
chunk_id: str = ""
|
|
|
|
# This is a 'batching' mechanism for the above data
|
|
@dataclass
|
|
class EntityContexts:
|
|
metadata: Metadata | None = None
|
|
entities: list[EntityContext] = field(default_factory=list)
|
|
|
|
############################################################################
|
|
|
|
# Graph triples
|
|
|
|
@dataclass
|
|
class Triples:
|
|
metadata: Metadata | None = None
|
|
triples: list[Triple] = field(default_factory=list)
|
|
|
|
############################################################################
|