trustgraph/trustgraph-base/trustgraph/schema/knowledge/document.py
cybermaggedon cd5580be59
Extract-time provenance (#661)
1. Shared Provenance Module - URI generators, namespace constants,
   triple builders, vocabulary bootstrap
2. Librarian - Emits document metadata to graph on processing
   initiation (vocabulary bootstrap + PROV-O triples)
3. PDF Extractor - Saves pages as child documents, emits parent-child
   provenance edges, forwards page IDs
4. Chunker - Saves chunks as child documents, emits provenance edges,
   forwards chunk ID + content
5. Knowledge Extractors (both definitions and relationships):
   - Link entities to chunks via SUBJECT_OF (not top-level document)
   - Removed duplicate metadata emission (now handled by librarian)
   - Get chunk_doc_id and chunk_uri from incoming Chunk message
6. Embedding Provenance:
   - EntityContext schema has chunk_id field
   - EntityEmbeddings schema has chunk_id field
   - Definitions extractor sets chunk_id when creating EntityContext
   - Graph embeddings processor passes chunk_id through to
     EntityEmbeddings

Provenance Flow:
Document → Page (PDF) → Chunk → Extracted Facts/Embeddings
    ↓           ↓          ↓              ↓
  librarian  librarian  librarian    (chunk_id reference)
  + graph    + graph    + graph

Each artifact is stored in librarian with parent-child linking, and PROV-O
edges are emitted to the knowledge graph for full traceability from any
extracted fact back to its source document.

Also, updating tests
2026-03-05 18:36:10 +00:00

42 lines
1.3 KiB
Python

from dataclasses import dataclass
from ..core.metadata import Metadata
from ..core.topic import topic
############################################################################
# PDF docs etc.
@dataclass
class Document:
metadata: Metadata | None = None
data: bytes = b""
# For large document streaming: if document_id is set, the receiver should
# fetch content from librarian instead of using inline data
document_id: str = ""
############################################################################
# Text documents / text from PDF
@dataclass
class TextDocument:
metadata: Metadata | None = None
text: bytes = b""
# For large document streaming: if document_id is set, the receiver should
# fetch content from librarian instead of using inline text
document_id: str = ""
############################################################################
# Chunks of text
@dataclass
class Chunk:
metadata: Metadata | None = None
chunk: bytes = b""
# For provenance: document_id of this chunk in librarian
# Post-chunker optimization: both document_id AND chunk content are included
# so downstream processors have the ID for provenance and content to work with
document_id: str = ""
############################################################################