The metadata field (list of triples) in the pipeline Metadata class was redundant. Document metadata triples already flow directly from librarian to triple-store via emit_document_provenance() - they don't need to pass through the extraction pipeline. Additionally, chunker and PDF decoder were overwriting metadata to [] anyway, so any metadata passed through the pipeline was being discarded. Changes: - Remove metadata field from Metadata dataclass (schema/core/metadata.py) - Update all Metadata instantiations to remove metadata=[] parameter - Remove metadata handling from translators (document_loading, knowledge) - Remove metadata consumption from extractors (ontology, agent) - Update gateway serializers and import handlers - Update all unit, integration, and contract tests
10 KiB
Extraction Flows
This document describes how data flows through the TrustGraph extraction pipeline, from document submission through to storage in knowledge stores.
Overview
┌──────────┐ ┌─────────────┐ ┌─────────┐ ┌────────────────────┐
│ Librarian│────▶│ PDF Decoder │────▶│ Chunker │────▶│ Knowledge │
│ │ │ (PDF only) │ │ │ │ Extraction │
│ │────────────────────────▶│ │ │ │
└──────────┘ └─────────────┘ └─────────┘ └────────────────────┘
│ │
│ ├──▶ Triples
│ ├──▶ Entity Contexts
│ └──▶ Rows
│
└──▶ Document Embeddings
Content Storage
Blob Storage (S3/Minio)
Document content is stored in S3-compatible blob storage:
- Path format:
doc/{object_id}where object_id is a UUID - All document types stored here: source documents, pages, chunks
Metadata Storage (Cassandra)
Document metadata stored in Cassandra includes:
- Document ID, title, kind (MIME type)
object_idreference to blob storageparent_idfor child documents (pages, chunks)document_type: "source", "page", "chunk", "answer"
Inline vs Streaming Threshold
Content transmission uses a size-based strategy:
- < 2MB: Content included inline in message (base64-encoded)
- ≥ 2MB: Only
document_idsent; processor fetches via librarian API
Stage 1: Document Submission (Librarian)
Entry Point
Documents enter the system via librarian's add-document operation:
- Content uploaded to blob storage
- Metadata record created in Cassandra
- Returns document ID
Triggering Extraction
The add-processing operation triggers extraction:
- Specifies
document_id,flow(pipeline ID),collection(target store) - Librarian's
load_document()fetches content and publishes to flow input queue
Schema: Document
Document
├── metadata: Metadata
│ ├── id: str # Document identifier
│ ├── user: str # Tenant/user ID
│ ├── collection: str # Target collection
│ └── metadata: list[Triple] # (largely unused, historical)
├── data: bytes # PDF content (base64, if inline)
└── document_id: str # Librarian reference (if streaming)
Routing: Based on kind field:
application/pdf→document-loadqueue → PDF Decodertext/plain→text-loadqueue → Chunker
Stage 2: PDF Decoder
Converts PDF documents into text pages.
Process
- Fetch content (inline
dataor viadocument_idfrom librarian) - Extract pages using PyPDF
- For each page:
- Save as child document in librarian (
{doc_id}/p{page_num}) - Emit provenance triples (page derived from document)
- Forward to chunker
- Save as child document in librarian (
Schema: TextDocument
TextDocument
├── metadata: Metadata
│ ├── id: str # Page URI (e.g., https://trustgraph.ai/doc/xxx/p1)
│ ├── user: str
│ ├── collection: str
│ └── metadata: list[Triple]
├── text: bytes # Page text content (if inline)
└── document_id: str # Librarian reference (e.g., "doc123/p1")
Stage 3: Chunker
Splits text into chunks at configured size.
Parameters (flow-configurable)
chunk_size: Target chunk size in characters (default: 2000)chunk_overlap: Overlap between chunks (default: 100)
Process
- Fetch text content (inline or via librarian)
- Split using recursive character splitter
- For each chunk:
- Save as child document in librarian (
{parent_id}/c{index}) - Emit provenance triples (chunk derived from page/document)
- Forward to extraction processors
- Save as child document in librarian (
Schema: Chunk
Chunk
├── metadata: Metadata
│ ├── id: str # Chunk URI
│ ├── user: str
│ ├── collection: str
│ └── metadata: list[Triple]
├── chunk: bytes # Chunk text content
└── document_id: str # Librarian chunk ID (e.g., "doc123/p1/c3")
Document ID Hierarchy
Child documents encode their lineage in the ID:
- Source:
doc123 - Page:
doc123/p5 - Chunk from page:
doc123/p5/c2 - Chunk from text:
doc123/c2
Stage 4: Knowledge Extraction
Multiple extraction patterns available, selected by flow configuration.
Pattern A: Basic GraphRAG
Two parallel processors:
kg-extract-definitions
- Input: Chunk
- Output: Triples (entity definitions), EntityContexts
- Extracts: entity labels, definitions
kg-extract-relationships
- Input: Chunk
- Output: Triples (relationships), EntityContexts
- Extracts: subject-predicate-object relationships
Pattern B: Ontology-Driven (kg-extract-ontology)
- Input: Chunk
- Output: Triples, EntityContexts
- Uses configured ontology to guide extraction
Pattern C: Agent-Based (kg-extract-agent)
- Input: Chunk
- Output: Triples, EntityContexts
- Uses agent framework for extraction
Pattern D: Row Extraction (kg-extract-rows)
- Input: Chunk
- Output: Rows (structured data, not triples)
- Uses schema definition to extract structured records
Schema: Triples
Triples
├── metadata: Metadata
│ ├── id: str
│ ├── user: str
│ ├── collection: str
│ └── metadata: list[Triple] # (set to [] by extractors)
└── triples: list[Triple]
└── Triple
├── s: Term # Subject
├── p: Term # Predicate
├── o: Term # Object
└── g: str | None # Named graph
Schema: EntityContexts
EntityContexts
├── metadata: Metadata
└── entities: list[EntityContext]
└── EntityContext
├── entity: Term # Entity identifier (IRI)
├── context: str # Textual description for embedding
└── chunk_id: str # Source chunk ID (provenance)
Schema: Rows
Rows
├── metadata: Metadata
├── row_schema: RowSchema
│ ├── name: str
│ ├── description: str
│ └── fields: list[Field]
└── rows: list[dict[str, str]] # Extracted records
Stage 5: Embeddings Generation
Graph Embeddings
Converts entity contexts into vector embeddings.
Process:
- Receive EntityContexts
- Call embeddings service with context text
- Output GraphEmbeddings (entity → vector mapping)
Schema: GraphEmbeddings
GraphEmbeddings
├── metadata: Metadata
└── entities: list[EntityEmbeddings]
└── EntityEmbeddings
├── entity: Term # Entity identifier
├── vector: list[float] # Embedding vector
└── chunk_id: str # Source chunk (provenance)
Document Embeddings
Converts chunk text directly into vector embeddings.
Process:
- Receive Chunk
- Call embeddings service with chunk text
- Output DocumentEmbeddings
Schema: DocumentEmbeddings
DocumentEmbeddings
├── metadata: Metadata
└── chunks: list[ChunkEmbeddings]
└── ChunkEmbeddings
├── chunk_id: str # Chunk identifier
└── vector: list[float] # Embedding vector
Row Embeddings
Converts row index fields into vector embeddings.
Process:
- Receive Rows
- Embed configured index fields
- Output to row vector store
Stage 6: Storage
Triple Store
- Receives: Triples
- Storage: Cassandra (entity-centric tables)
- Named graphs separate core knowledge from provenance:
""(default): Core knowledge factsurn:graph:source: Extraction provenanceurn:graph:retrieval: Query-time explainability
Vector Store (Graph Embeddings)
- Receives: GraphEmbeddings
- Storage: Qdrant, Milvus, or Pinecone
- Indexed by: entity IRI
- Metadata: chunk_id for provenance
Vector Store (Document Embeddings)
- Receives: DocumentEmbeddings
- Storage: Qdrant, Milvus, or Pinecone
- Indexed by: chunk_id
Row Store
- Receives: Rows
- Storage: Cassandra
- Schema-driven table structure
Row Vector Store
- Receives: Row embeddings
- Storage: Vector DB
- Indexed by: row index fields
Metadata Field Analysis
Actively Used Fields
| Field | Usage |
|---|---|
metadata.id |
Document/chunk identifier, logging, provenance |
metadata.user |
Multi-tenancy, storage routing |
metadata.collection |
Target collection selection |
document_id |
Librarian reference, provenance linking |
chunk_id |
Provenance tracking through pipeline |
<<<<<<< HEAD
Potentially Redundant Fields
| Field | Status |
|---|---|
metadata.metadata |
Set to [] by all extractors; document-level metadata now handled by librarian at submission time |
=======
Removed Fields
| Field | Status |
|---|---|
metadata.metadata |
Removed from Metadata class. Document-level metadata triples are now emitted directly by librarian to triple store at submission time, not carried through the extraction pipeline. |
e3bcbf73 (The metadata field (list of triples) in the pipeline Metadata class)
Bytes Fields Pattern
All content fields (data, text, chunk) are bytes but immediately decoded to UTF-8 strings by all processors. No processor uses raw bytes.
Flow Configuration
Flows are defined externally and provided to librarian via config service. Each flow specifies:
- Input queues (
text-load,document-load) - Processor chain
- Parameters (chunk size, extraction method, etc.)
Example flow patterns:
pdf-graphrag: PDF → Decoder → Chunker → Definitions + Relationships → Embeddingstext-graphrag: Text → Chunker → Definitions + Relationships → Embeddingspdf-ontology: PDF → Decoder → Chunker → Ontology Extraction → Embeddingstext-rows: Text → Chunker → Row Extraction → Row Store