Extract-time provenance (#661)

1. Shared Provenance Module - URI generators, namespace constants, triple builders, vocabulary bootstrap 2. Librarian - Emits document metadata to graph on processing initiation (vocabulary bootstrap + PROV-O triples) 3. PDF Extractor - Saves pages as child documents, emits parent-child provenance edges, forwards page IDs 4. Chunker - Saves chunks as child documents, emits provenance edges, forwards chunk ID + content 5. Knowledge Extractors (both definitions and relationships): - Link entities to chunks via SUBJECT_OF (not top-level document) - Removed duplicate metadata emission (now handled by librarian) - Get chunk_doc_id and chunk_uri from incoming Chunk message 6. Embedding Provenance: - EntityContext schema has chunk_id field - EntityEmbeddings schema has chunk_id field - Definitions extractor sets chunk_id when creating EntityContext - Graph embeddings processor passes chunk_id through to EntityEmbeddings Provenance Flow: Document → Page (PDF) → Chunk → Extracted Facts/Embeddings ↓ ↓ ↓ ↓ librarian librarian librarian (chunk_id reference) + graph + graph + graph Each artifact is stored in librarian with parent-child linking, and PROV-O edges are emitted to the knowledge graph for full traceability from any extracted fact back to its source document. Also, updating tests
2026-05-18 03:45:12 +02:00 · 2026-03-05 18:36:10 +00:00 · 2026-03-05 18:36:10 +00:00 · cd5580be59
commit cd5580be59
parent d8f0a576af
20 changed files with 1601 additions and 59 deletions
--- a/trustgraph-base/trustgraph/schema/knowledge/document.py
+++ b/trustgraph-base/trustgraph/schema/knowledge/document.py
@ -34,5 +34,9 @@ class TextDocument:
 class Chunk:
    metadata: Metadata | None = None
    chunk: bytes = b""
+    # For provenance: document_id of this chunk in librarian
+    # Post-chunker optimization: both document_id AND chunk content are included
+    # so downstream processors have the ID for provenance and content to work with
+    document_id: str = ""

 ############################################################################
--- a/trustgraph-base/trustgraph/schema/knowledge/embeddings.py
+++ b/trustgraph-base/trustgraph/schema/knowledge/embeddings.py
@ -12,6 +12,8 @@ from ..core.topic import topic
 class EntityEmbeddings:
    entity: Term | None = None
    vectors: list[list[float]] = field(default_factory=list)
+    # Provenance: which chunk this embedding was derived from
+    chunk_id: str = ""

 # This is a 'batching' mechanism for the above data
@dataclass
--- a/trustgraph-base/trustgraph/schema/knowledge/graph.py
+++ b/trustgraph-base/trustgraph/schema/knowledge/graph.py
@ -12,6 +12,8 @@ from ..core.topic import topic
 class EntityContext:
    entity: Term | None = None
    context: str = ""
+    # Provenance: which chunk this entity context was derived from
+    chunk_id: str = ""

 # This is a 'batching' mechanism for the above data
@dataclass
--- a/trustgraph-base/trustgraph/schema/services/library.py
+++ b/trustgraph-base/trustgraph/schema/services/library.py
@ -91,7 +91,12 @@ class DocumentMetadata:
    tags: list[str] = field(default_factory=list)
    # Child document support
    parent_id: str = ""  # Empty for top-level docs, set for children
-    document_type: str = "source"  # "source" or "extracted"
+    # Document type vocabulary:
+    #   "source" - original uploaded document
+    #   "page" - page extracted from source (e.g., PDF page)
+    #   "chunk" - text chunk derived from page or source
+    #   "extracted" - legacy value, kept for backwards compatibility
+    document_type: str = "source"

@dataclass
 class ProcessingMetadata: