Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.
10 KiB
| layout | title | parent |
|---|---|---|
| default | Extraction Flows | Tech Specs |
Extraction Flows
This document describes how data flows through the TrustGraph extraction pipeline, from document submission through to storage in knowledge stores.
Overview
┌──────────┐ ┌─────────────┐ ┌─────────┐ ┌────────────────────┐
│ Librarian│────▶│ PDF Decoder │────▶│ Chunker │────▶│ Knowledge │
│ │ │ (PDF only) │ │ │ │ Extraction │
│ │────────────────────────▶│ │ │ │
└──────────┘ └─────────────┘ └─────────┘ └────────────────────┘
│ │
│ ├──▶ Triples
│ ├──▶ Entity Contexts
│ └──▶ Rows
│
└──▶ Document Embeddings
Content Storage
Blob Storage (S3/Minio)
Document content is stored in S3-compatible blob storage:
- Path format:
doc/{object_id}where object_id is a UUID - All document types stored here: source documents, pages, chunks
Metadata Storage (Cassandra)
Document metadata stored in Cassandra includes:
- Document ID, title, kind (MIME type)
object_idreference to blob storageparent_idfor child documents (pages, chunks)document_type: "source", "page", "chunk", "answer"
Inline vs Streaming Threshold
Content transmission uses a size-based strategy:
- < 2MB: Content included inline in message (base64-encoded)
- ≥ 2MB: Only
document_idsent; processor fetches via librarian API
Stage 1: Document Submission (Librarian)
Entry Point
Documents enter the system via librarian's add-document operation:
- Content uploaded to blob storage
- Metadata record created in Cassandra
- Returns document ID
Triggering Extraction
The add-processing operation triggers extraction:
- Specifies
document_id,flow(pipeline ID),collection(target store) - Librarian's
load_document()fetches content and publishes to flow input queue
Schema: Document
Document
├── metadata: Metadata
│ ├── id: str # Document identifier
│ ├── user: str # Tenant/user ID
│ ├── collection: str # Target collection
│ └── metadata: list[Triple] # (largely unused, historical)
├── data: bytes # PDF content (base64, if inline)
└── document_id: str # Librarian reference (if streaming)
Routing: Based on kind field:
application/pdf→document-loadqueue → PDF Decodertext/plain→text-loadqueue → Chunker
Stage 2: PDF Decoder
Converts PDF documents into text pages.
Process
- Fetch content (inline
dataor viadocument_idfrom librarian) - Extract pages using PyPDF
- For each page:
- Save as child document in librarian (
{doc_id}/p{page_num}) - Emit provenance triples (page derived from document)
- Forward to chunker
- Save as child document in librarian (
Schema: TextDocument
TextDocument
├── metadata: Metadata
│ ├── id: str # Page URI (e.g., https://trustgraph.ai/doc/xxx/p1)
│ ├── user: str
│ ├── collection: str
│ └── metadata: list[Triple]
├── text: bytes # Page text content (if inline)
└── document_id: str # Librarian reference (e.g., "doc123/p1")
Stage 3: Chunker
Splits text into chunks at configured size.
Parameters (flow-configurable)
chunk_size: Target chunk size in characters (default: 2000)chunk_overlap: Overlap between chunks (default: 100)
Process
- Fetch text content (inline or via librarian)
- Split using recursive character splitter
- For each chunk:
- Save as child document in librarian (
{parent_id}/c{index}) - Emit provenance triples (chunk derived from page/document)
- Forward to extraction processors
- Save as child document in librarian (
Schema: Chunk
Chunk
├── metadata: Metadata
│ ├── id: str # Chunk URI
│ ├── user: str
│ ├── collection: str
│ └── metadata: list[Triple]
├── chunk: bytes # Chunk text content
└── document_id: str # Librarian chunk ID (e.g., "doc123/p1/c3")
Document ID Hierarchy
Child documents encode their lineage in the ID:
- Source:
doc123 - Page:
doc123/p5 - Chunk from page:
doc123/p5/c2 - Chunk from text:
doc123/c2
Stage 4: Knowledge Extraction
Multiple extraction patterns available, selected by flow configuration.
Pattern A: Basic GraphRAG
Two parallel processors:
kg-extract-definitions
- Input: Chunk
- Output: Triples (entity definitions), EntityContexts
- Extracts: entity labels, definitions
kg-extract-relationships
- Input: Chunk
- Output: Triples (relationships), EntityContexts
- Extracts: subject-predicate-object relationships
Pattern B: Ontology-Driven (kg-extract-ontology)
- Input: Chunk
- Output: Triples, EntityContexts
- Uses configured ontology to guide extraction
Pattern C: Agent-Based (kg-extract-agent)
- Input: Chunk
- Output: Triples, EntityContexts
- Uses agent framework for extraction
Pattern D: Row Extraction (kg-extract-rows)
- Input: Chunk
- Output: Rows (structured data, not triples)
- Uses schema definition to extract structured records
Schema: Triples
Triples
├── metadata: Metadata
│ ├── id: str
│ ├── user: str
│ ├── collection: str
│ └── metadata: list[Triple] # (set to [] by extractors)
└── triples: list[Triple]
└── Triple
├── s: Term # Subject
├── p: Term # Predicate
├── o: Term # Object
└── g: str | None # Named graph
Schema: EntityContexts
EntityContexts
├── metadata: Metadata
└── entities: list[EntityContext]
└── EntityContext
├── entity: Term # Entity identifier (IRI)
├── context: str # Textual description for embedding
└── chunk_id: str # Source chunk ID (provenance)
Schema: Rows
Rows
├── metadata: Metadata
├── row_schema: RowSchema
│ ├── name: str
│ ├── description: str
│ └── fields: list[Field]
└── rows: list[dict[str, str]] # Extracted records
Stage 5: Embeddings Generation
Graph Embeddings
Converts entity contexts into vector embeddings.
Process:
- Receive EntityContexts
- Call embeddings service with context text
- Output GraphEmbeddings (entity → vector mapping)
Schema: GraphEmbeddings
GraphEmbeddings
├── metadata: Metadata
└── entities: list[EntityEmbeddings]
└── EntityEmbeddings
├── entity: Term # Entity identifier
├── vector: list[float] # Embedding vector
└── chunk_id: str # Source chunk (provenance)
Document Embeddings
Converts chunk text directly into vector embeddings.
Process:
- Receive Chunk
- Call embeddings service with chunk text
- Output DocumentEmbeddings
Schema: DocumentEmbeddings
DocumentEmbeddings
├── metadata: Metadata
└── chunks: list[ChunkEmbeddings]
└── ChunkEmbeddings
├── chunk_id: str # Chunk identifier
└── vector: list[float] # Embedding vector
Row Embeddings
Converts row index fields into vector embeddings.
Process:
- Receive Rows
- Embed configured index fields
- Output to row vector store
Stage 6: Storage
Triple Store
- Receives: Triples
- Storage: Cassandra (entity-centric tables)
- Named graphs separate core knowledge from provenance:
""(default): Core knowledge factsurn:graph:source: Extraction provenanceurn:graph:retrieval: Query-time explainability
Vector Store (Graph Embeddings)
- Receives: GraphEmbeddings
- Storage: Qdrant, Milvus, or Pinecone
- Indexed by: entity IRI
- Metadata: chunk_id for provenance
Vector Store (Document Embeddings)
- Receives: DocumentEmbeddings
- Storage: Qdrant, Milvus, or Pinecone
- Indexed by: chunk_id
Row Store
- Receives: Rows
- Storage: Cassandra
- Schema-driven table structure
Row Vector Store
- Receives: Row embeddings
- Storage: Vector DB
- Indexed by: row index fields
Metadata Field Analysis
Actively Used Fields
| Field | Usage |
|---|---|
metadata.id |
Document/chunk identifier, logging, provenance |
metadata.user |
Multi-tenancy, storage routing |
metadata.collection |
Target collection selection |
document_id |
Librarian reference, provenance linking |
chunk_id |
Provenance tracking through pipeline |
<<<<<<< HEAD
Potentially Redundant Fields
| Field | Status |
|---|---|
metadata.metadata |
Set to [] by all extractors; document-level metadata now handled by librarian at submission time |
=======
Removed Fields
| Field | Status |
|---|---|
metadata.metadata |
Removed from Metadata class. Document-level metadata triples are now emitted directly by librarian to triple store at submission time, not carried through the extraction pipeline. |
e3bcbf73 (The metadata field (list of triples) in the pipeline Metadata class)
Bytes Fields Pattern
All content fields (data, text, chunk) are bytes but immediately decoded to UTF-8 strings by all processors. No processor uses raw bytes.
Flow Configuration
Flows are defined externally and provided to librarian via config service. Each flow specifies:
- Input queues (
text-load,document-load) - Processor chain
- Parameters (chunk size, extraction method, etc.)
Example flow patterns:
pdf-graphrag: PDF → Decoder → Chunker → Definitions + Relationships → Embeddingstext-graphrag: Text → Chunker → Definitions + Relationships → Embeddingspdf-ontology: PDF → Decoder → Chunker → Ontology Extraction → Embeddingstext-rows: Text → Chunker → Row Extraction → Row Store