mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-05-18 03:45:12 +02:00
Extract-time provenance (#661)
1. Shared Provenance Module - URI generators, namespace constants,
triple builders, vocabulary bootstrap
2. Librarian - Emits document metadata to graph on processing
initiation (vocabulary bootstrap + PROV-O triples)
3. PDF Extractor - Saves pages as child documents, emits parent-child
provenance edges, forwards page IDs
4. Chunker - Saves chunks as child documents, emits provenance edges,
forwards chunk ID + content
5. Knowledge Extractors (both definitions and relationships):
- Link entities to chunks via SUBJECT_OF (not top-level document)
- Removed duplicate metadata emission (now handled by librarian)
- Get chunk_doc_id and chunk_uri from incoming Chunk message
6. Embedding Provenance:
- EntityContext schema has chunk_id field
- EntityEmbeddings schema has chunk_id field
- Definitions extractor sets chunk_id when creating EntityContext
- Graph embeddings processor passes chunk_id through to
EntityEmbeddings
Provenance Flow:
Document → Page (PDF) → Chunk → Extracted Facts/Embeddings
↓ ↓ ↓ ↓
librarian librarian librarian (chunk_id reference)
+ graph + graph + graph
Each artifact is stored in librarian with parent-child linking, and PROV-O
edges are emitted to the knowledge graph for full traceability from any
extracted fact back to its source document.
Also, updating tests
This commit is contained in:
parent
d8f0a576af
commit
cd5580be59
20 changed files with 1601 additions and 59 deletions
|
|
@ -34,5 +34,9 @@ class TextDocument:
|
|||
class Chunk:
|
||||
metadata: Metadata | None = None
|
||||
chunk: bytes = b""
|
||||
# For provenance: document_id of this chunk in librarian
|
||||
# Post-chunker optimization: both document_id AND chunk content are included
|
||||
# so downstream processors have the ID for provenance and content to work with
|
||||
document_id: str = ""
|
||||
|
||||
############################################################################
|
||||
|
|
|
|||
|
|
@ -12,6 +12,8 @@ from ..core.topic import topic
|
|||
class EntityEmbeddings:
|
||||
entity: Term | None = None
|
||||
vectors: list[list[float]] = field(default_factory=list)
|
||||
# Provenance: which chunk this embedding was derived from
|
||||
chunk_id: str = ""
|
||||
|
||||
# This is a 'batching' mechanism for the above data
|
||||
@dataclass
|
||||
|
|
|
|||
|
|
@ -12,6 +12,8 @@ from ..core.topic import topic
|
|||
class EntityContext:
|
||||
entity: Term | None = None
|
||||
context: str = ""
|
||||
# Provenance: which chunk this entity context was derived from
|
||||
chunk_id: str = ""
|
||||
|
||||
# This is a 'batching' mechanism for the above data
|
||||
@dataclass
|
||||
|
|
|
|||
|
|
@ -91,7 +91,12 @@ class DocumentMetadata:
|
|||
tags: list[str] = field(default_factory=list)
|
||||
# Child document support
|
||||
parent_id: str = "" # Empty for top-level docs, set for children
|
||||
document_type: str = "source" # "source" or "extracted"
|
||||
# Document type vocabulary:
|
||||
# "source" - original uploaded document
|
||||
# "page" - page extracted from source (e.g., PDF page)
|
||||
# "chunk" - text chunk derived from page or source
|
||||
# "extracted" - legacy value, kept for backwards compatibility
|
||||
document_type: str = "source"
|
||||
|
||||
@dataclass
|
||||
class ProcessingMetadata:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue