mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 00:16:23 +02:00
1. Shared Provenance Module - URI generators, namespace constants,
triple builders, vocabulary bootstrap
2. Librarian - Emits document metadata to graph on processing
initiation (vocabulary bootstrap + PROV-O triples)
3. PDF Extractor - Saves pages as child documents, emits parent-child
provenance edges, forwards page IDs
4. Chunker - Saves chunks as child documents, emits provenance edges,
forwards chunk ID + content
5. Knowledge Extractors (both definitions and relationships):
- Link entities to chunks via SUBJECT_OF (not top-level document)
- Removed duplicate metadata emission (now handled by librarian)
- Get chunk_doc_id and chunk_uri from incoming Chunk message
6. Embedding Provenance:
- EntityContext schema has chunk_id field
- EntityEmbeddings schema has chunk_id field
- Definitions extractor sets chunk_id when creating EntityContext
- Graph embeddings processor passes chunk_id through to
EntityEmbeddings
Provenance Flow:
Document → Page (PDF) → Chunk → Extracted Facts/Embeddings
↓ ↓ ↓ ↓
librarian librarian librarian (chunk_id reference)
+ graph + graph + graph
Each artifact is stored in librarian with parent-child linking, and PROV-O
edges are emitted to the knowledge graph for full traceability from any
extracted fact back to its source document.
Also, updating tests
42 lines
1.3 KiB
Python
42 lines
1.3 KiB
Python
from dataclasses import dataclass
|
|
|
|
from ..core.metadata import Metadata
|
|
from ..core.topic import topic
|
|
|
|
############################################################################
|
|
|
|
# PDF docs etc.
|
|
@dataclass
|
|
class Document:
|
|
metadata: Metadata | None = None
|
|
data: bytes = b""
|
|
# For large document streaming: if document_id is set, the receiver should
|
|
# fetch content from librarian instead of using inline data
|
|
document_id: str = ""
|
|
|
|
############################################################################
|
|
|
|
# Text documents / text from PDF
|
|
|
|
@dataclass
|
|
class TextDocument:
|
|
metadata: Metadata | None = None
|
|
text: bytes = b""
|
|
# For large document streaming: if document_id is set, the receiver should
|
|
# fetch content from librarian instead of using inline text
|
|
document_id: str = ""
|
|
|
|
############################################################################
|
|
|
|
# Chunks of text
|
|
|
|
@dataclass
|
|
class Chunk:
|
|
metadata: Metadata | None = None
|
|
chunk: bytes = b""
|
|
# For provenance: document_id of this chunk in librarian
|
|
# Post-chunker optimization: both document_id AND chunk content are included
|
|
# so downstream processors have the ID for provenance and content to work with
|
|
document_id: str = ""
|
|
|
|
############################################################################
|