mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 08:26:21 +02:00
Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.
353 lines
10 KiB
Markdown
353 lines
10 KiB
Markdown
---
|
|
layout: default
|
|
title: "Extraction Flows"
|
|
parent: "Tech Specs"
|
|
---
|
|
|
|
# Extraction Flows
|
|
|
|
This document describes how data flows through the TrustGraph extraction pipeline, from document submission through to storage in knowledge stores.
|
|
|
|
## Overview
|
|
|
|
```
|
|
┌──────────┐ ┌─────────────┐ ┌─────────┐ ┌────────────────────┐
|
|
│ Librarian│────▶│ PDF Decoder │────▶│ Chunker │────▶│ Knowledge │
|
|
│ │ │ (PDF only) │ │ │ │ Extraction │
|
|
│ │────────────────────────▶│ │ │ │
|
|
└──────────┘ └─────────────┘ └─────────┘ └────────────────────┘
|
|
│ │
|
|
│ ├──▶ Triples
|
|
│ ├──▶ Entity Contexts
|
|
│ └──▶ Rows
|
|
│
|
|
└──▶ Document Embeddings
|
|
```
|
|
|
|
## Content Storage
|
|
|
|
### Blob Storage (S3/Minio)
|
|
|
|
Document content is stored in S3-compatible blob storage:
|
|
- Path format: `doc/{object_id}` where object_id is a UUID
|
|
- All document types stored here: source documents, pages, chunks
|
|
|
|
### Metadata Storage (Cassandra)
|
|
|
|
Document metadata stored in Cassandra includes:
|
|
- Document ID, title, kind (MIME type)
|
|
- `object_id` reference to blob storage
|
|
- `parent_id` for child documents (pages, chunks)
|
|
- `document_type`: "source", "page", "chunk", "answer"
|
|
|
|
### Inline vs Streaming Threshold
|
|
|
|
Content transmission uses a size-based strategy:
|
|
- **< 2MB**: Content included inline in message (base64-encoded)
|
|
- **≥ 2MB**: Only `document_id` sent; processor fetches via librarian API
|
|
|
|
## Stage 1: Document Submission (Librarian)
|
|
|
|
### Entry Point
|
|
|
|
Documents enter the system via librarian's `add-document` operation:
|
|
1. Content uploaded to blob storage
|
|
2. Metadata record created in Cassandra
|
|
3. Returns document ID
|
|
|
|
### Triggering Extraction
|
|
|
|
The `add-processing` operation triggers extraction:
|
|
- Specifies `document_id`, `flow` (pipeline ID), `collection` (target store)
|
|
- Librarian's `load_document()` fetches content and publishes to flow input queue
|
|
|
|
### Schema: Document
|
|
|
|
```
|
|
Document
|
|
├── metadata: Metadata
|
|
│ ├── id: str # Document identifier
|
|
│ ├── user: str # Tenant/user ID
|
|
│ ├── collection: str # Target collection
|
|
│ └── metadata: list[Triple] # (largely unused, historical)
|
|
├── data: bytes # PDF content (base64, if inline)
|
|
└── document_id: str # Librarian reference (if streaming)
|
|
```
|
|
|
|
**Routing**: Based on `kind` field:
|
|
- `application/pdf` → `document-load` queue → PDF Decoder
|
|
- `text/plain` → `text-load` queue → Chunker
|
|
|
|
## Stage 2: PDF Decoder
|
|
|
|
Converts PDF documents into text pages.
|
|
|
|
### Process
|
|
|
|
1. Fetch content (inline `data` or via `document_id` from librarian)
|
|
2. Extract pages using PyPDF
|
|
3. For each page:
|
|
- Save as child document in librarian (`{doc_id}/p{page_num}`)
|
|
- Emit provenance triples (page derived from document)
|
|
- Forward to chunker
|
|
|
|
### Schema: TextDocument
|
|
|
|
```
|
|
TextDocument
|
|
├── metadata: Metadata
|
|
│ ├── id: str # Page URI (e.g., https://trustgraph.ai/doc/xxx/p1)
|
|
│ ├── user: str
|
|
│ ├── collection: str
|
|
│ └── metadata: list[Triple]
|
|
├── text: bytes # Page text content (if inline)
|
|
└── document_id: str # Librarian reference (e.g., "doc123/p1")
|
|
```
|
|
|
|
## Stage 3: Chunker
|
|
|
|
Splits text into chunks at configured size.
|
|
|
|
### Parameters (flow-configurable)
|
|
|
|
- `chunk_size`: Target chunk size in characters (default: 2000)
|
|
- `chunk_overlap`: Overlap between chunks (default: 100)
|
|
|
|
### Process
|
|
|
|
1. Fetch text content (inline or via librarian)
|
|
2. Split using recursive character splitter
|
|
3. For each chunk:
|
|
- Save as child document in librarian (`{parent_id}/c{index}`)
|
|
- Emit provenance triples (chunk derived from page/document)
|
|
- Forward to extraction processors
|
|
|
|
### Schema: Chunk
|
|
|
|
```
|
|
Chunk
|
|
├── metadata: Metadata
|
|
│ ├── id: str # Chunk URI
|
|
│ ├── user: str
|
|
│ ├── collection: str
|
|
│ └── metadata: list[Triple]
|
|
├── chunk: bytes # Chunk text content
|
|
└── document_id: str # Librarian chunk ID (e.g., "doc123/p1/c3")
|
|
```
|
|
|
|
### Document ID Hierarchy
|
|
|
|
Child documents encode their lineage in the ID:
|
|
- Source: `doc123`
|
|
- Page: `doc123/p5`
|
|
- Chunk from page: `doc123/p5/c2`
|
|
- Chunk from text: `doc123/c2`
|
|
|
|
## Stage 4: Knowledge Extraction
|
|
|
|
Multiple extraction patterns available, selected by flow configuration.
|
|
|
|
### Pattern A: Basic GraphRAG
|
|
|
|
Two parallel processors:
|
|
|
|
**kg-extract-definitions**
|
|
- Input: Chunk
|
|
- Output: Triples (entity definitions), EntityContexts
|
|
- Extracts: entity labels, definitions
|
|
|
|
**kg-extract-relationships**
|
|
- Input: Chunk
|
|
- Output: Triples (relationships), EntityContexts
|
|
- Extracts: subject-predicate-object relationships
|
|
|
|
### Pattern B: Ontology-Driven (kg-extract-ontology)
|
|
|
|
- Input: Chunk
|
|
- Output: Triples, EntityContexts
|
|
- Uses configured ontology to guide extraction
|
|
|
|
### Pattern C: Agent-Based (kg-extract-agent)
|
|
|
|
- Input: Chunk
|
|
- Output: Triples, EntityContexts
|
|
- Uses agent framework for extraction
|
|
|
|
### Pattern D: Row Extraction (kg-extract-rows)
|
|
|
|
- Input: Chunk
|
|
- Output: Rows (structured data, not triples)
|
|
- Uses schema definition to extract structured records
|
|
|
|
### Schema: Triples
|
|
|
|
```
|
|
Triples
|
|
├── metadata: Metadata
|
|
│ ├── id: str
|
|
│ ├── user: str
|
|
│ ├── collection: str
|
|
│ └── metadata: list[Triple] # (set to [] by extractors)
|
|
└── triples: list[Triple]
|
|
└── Triple
|
|
├── s: Term # Subject
|
|
├── p: Term # Predicate
|
|
├── o: Term # Object
|
|
└── g: str | None # Named graph
|
|
```
|
|
|
|
### Schema: EntityContexts
|
|
|
|
```
|
|
EntityContexts
|
|
├── metadata: Metadata
|
|
└── entities: list[EntityContext]
|
|
└── EntityContext
|
|
├── entity: Term # Entity identifier (IRI)
|
|
├── context: str # Textual description for embedding
|
|
└── chunk_id: str # Source chunk ID (provenance)
|
|
```
|
|
|
|
### Schema: Rows
|
|
|
|
```
|
|
Rows
|
|
├── metadata: Metadata
|
|
├── row_schema: RowSchema
|
|
│ ├── name: str
|
|
│ ├── description: str
|
|
│ └── fields: list[Field]
|
|
└── rows: list[dict[str, str]] # Extracted records
|
|
```
|
|
|
|
## Stage 5: Embeddings Generation
|
|
|
|
### Graph Embeddings
|
|
|
|
Converts entity contexts into vector embeddings.
|
|
|
|
**Process:**
|
|
1. Receive EntityContexts
|
|
2. Call embeddings service with context text
|
|
3. Output GraphEmbeddings (entity → vector mapping)
|
|
|
|
**Schema: GraphEmbeddings**
|
|
|
|
```
|
|
GraphEmbeddings
|
|
├── metadata: Metadata
|
|
└── entities: list[EntityEmbeddings]
|
|
└── EntityEmbeddings
|
|
├── entity: Term # Entity identifier
|
|
├── vector: list[float] # Embedding vector
|
|
└── chunk_id: str # Source chunk (provenance)
|
|
```
|
|
|
|
### Document Embeddings
|
|
|
|
Converts chunk text directly into vector embeddings.
|
|
|
|
**Process:**
|
|
1. Receive Chunk
|
|
2. Call embeddings service with chunk text
|
|
3. Output DocumentEmbeddings
|
|
|
|
**Schema: DocumentEmbeddings**
|
|
|
|
```
|
|
DocumentEmbeddings
|
|
├── metadata: Metadata
|
|
└── chunks: list[ChunkEmbeddings]
|
|
└── ChunkEmbeddings
|
|
├── chunk_id: str # Chunk identifier
|
|
└── vector: list[float] # Embedding vector
|
|
```
|
|
|
|
### Row Embeddings
|
|
|
|
Converts row index fields into vector embeddings.
|
|
|
|
**Process:**
|
|
1. Receive Rows
|
|
2. Embed configured index fields
|
|
3. Output to row vector store
|
|
|
|
## Stage 6: Storage
|
|
|
|
### Triple Store
|
|
|
|
- Receives: Triples
|
|
- Storage: Cassandra (entity-centric tables)
|
|
- Named graphs separate core knowledge from provenance:
|
|
- `""` (default): Core knowledge facts
|
|
- `urn:graph:source`: Extraction provenance
|
|
- `urn:graph:retrieval`: Query-time explainability
|
|
|
|
### Vector Store (Graph Embeddings)
|
|
|
|
- Receives: GraphEmbeddings
|
|
- Storage: Qdrant, Milvus, or Pinecone
|
|
- Indexed by: entity IRI
|
|
- Metadata: chunk_id for provenance
|
|
|
|
### Vector Store (Document Embeddings)
|
|
|
|
- Receives: DocumentEmbeddings
|
|
- Storage: Qdrant, Milvus, or Pinecone
|
|
- Indexed by: chunk_id
|
|
|
|
### Row Store
|
|
|
|
- Receives: Rows
|
|
- Storage: Cassandra
|
|
- Schema-driven table structure
|
|
|
|
### Row Vector Store
|
|
|
|
- Receives: Row embeddings
|
|
- Storage: Vector DB
|
|
- Indexed by: row index fields
|
|
|
|
## Metadata Field Analysis
|
|
|
|
### Actively Used Fields
|
|
|
|
| Field | Usage |
|
|
|-------|-------|
|
|
| `metadata.id` | Document/chunk identifier, logging, provenance |
|
|
| `metadata.user` | Multi-tenancy, storage routing |
|
|
| `metadata.collection` | Target collection selection |
|
|
| `document_id` | Librarian reference, provenance linking |
|
|
| `chunk_id` | Provenance tracking through pipeline |
|
|
|
|
<<<<<<< HEAD
|
|
### Potentially Redundant Fields
|
|
|
|
| Field | Status |
|
|
|-------|--------|
|
|
| `metadata.metadata` | Set to `[]` by all extractors; document-level metadata now handled by librarian at submission time |
|
|
=======
|
|
### Removed Fields
|
|
|
|
| Field | Status |
|
|
|-------|--------|
|
|
| `metadata.metadata` | Removed from `Metadata` class. Document-level metadata triples are now emitted directly by librarian to triple store at submission time, not carried through the extraction pipeline. |
|
|
>>>>>>> e3bcbf73 (The metadata field (list of triples) in the pipeline Metadata class)
|
|
|
|
### Bytes Fields Pattern
|
|
|
|
All content fields (`data`, `text`, `chunk`) are `bytes` but immediately decoded to UTF-8 strings by all processors. No processor uses raw bytes.
|
|
|
|
## Flow Configuration
|
|
|
|
Flows are defined externally and provided to librarian via config service. Each flow specifies:
|
|
|
|
- Input queues (`text-load`, `document-load`)
|
|
- Processor chain
|
|
- Parameters (chunk size, extraction method, etc.)
|
|
|
|
Example flow patterns:
|
|
- `pdf-graphrag`: PDF → Decoder → Chunker → Definitions + Relationships → Embeddings
|
|
- `text-graphrag`: Text → Chunker → Definitions + Relationships → Embeddings
|
|
- `pdf-ontology`: PDF → Decoder → Chunker → Ontology Extraction → Embeddings
|
|
- `text-rows`: Text → Chunker → Row Extraction → Row Store
|