mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-04-25 00:16:23 +02:00

Dataflow tech spec for extraction (#684 )

2026-03-11 10:49:50 +00:00

10 KiB

Raw Blame History

Extraction Flows

This document describes how data flows through the TrustGraph extraction pipeline, from document submission through to storage in knowledge stores.

Overview

┌──────────┐     ┌─────────────┐     ┌─────────┐     ┌────────────────────┐
│ Librarian│────▶│ PDF Decoder │────▶│ Chunker │────▶│ Knowledge          │
│          │     │ (PDF only)  │     │         │     │ Extraction         │
│          │────────────────────────▶│         │     │                    │
└──────────┘     └─────────────┘     └─────────┘     └────────────────────┘
                                          │                    │
                                          │                    ├──▶ Triples
                                          │                    ├──▶ Entity Contexts
                                          │                    └──▶ Rows
                                          │
                                          └──▶ Document Embeddings

Content Storage

Blob Storage (S3/Minio)

Document content is stored in S3-compatible blob storage:

Path format: doc/{object_id} where object_id is a UUID
All document types stored here: source documents, pages, chunks

Metadata Storage (Cassandra)

Document metadata stored in Cassandra includes:

Document ID, title, kind (MIME type)
object_id reference to blob storage
parent_id for child documents (pages, chunks)
document_type: "source", "page", "chunk", "answer"

Inline vs Streaming Threshold

Content transmission uses a size-based strategy:

< 2MB: Content included inline in message (base64-encoded)
≥ 2MB: Only document_id sent; processor fetches via librarian API

Stage 1: Document Submission (Librarian)

Entry Point

Documents enter the system via librarian's add-document operation:

Content uploaded to blob storage
Metadata record created in Cassandra
Returns document ID

Triggering Extraction

The add-processing operation triggers extraction:

Specifies document_id, flow (pipeline ID), collection (target store)
Librarian's load_document() fetches content and publishes to flow input queue

Schema: Document

Document
├── metadata: Metadata
│   ├── id: str              # Document identifier
│   ├── user: str            # Tenant/user ID
│   ├── collection: str      # Target collection
│   └── metadata: list[Triple]  # (largely unused, historical)
├── data: bytes              # PDF content (base64, if inline)
└── document_id: str         # Librarian reference (if streaming)

Routing: Based on kind field:

application/pdf → document-load queue → PDF Decoder
text/plain → text-load queue → Chunker

Stage 2: PDF Decoder

Converts PDF documents into text pages.

Process

Fetch content (inline data or via document_id from librarian)
Extract pages using PyPDF
For each page:
- Save as child document in librarian ({doc_id}/p{page_num})
- Emit provenance triples (page derived from document)
- Forward to chunker

Schema: TextDocument

TextDocument
├── metadata: Metadata
│   ├── id: str              # Page URI (e.g., https://trustgraph.ai/doc/xxx/p1)
│   ├── user: str
│   ├── collection: str
│   └── metadata: list[Triple]
├── text: bytes              # Page text content (if inline)
└── document_id: str         # Librarian reference (e.g., "doc123/p1")

Stage 3: Chunker

Splits text into chunks at configured size.

Parameters (flow-configurable)

chunk_size: Target chunk size in characters (default: 2000)
chunk_overlap: Overlap between chunks (default: 100)

Process

Fetch text content (inline or via librarian)
Split using recursive character splitter
For each chunk:
- Save as child document in librarian ({parent_id}/c{index})
- Emit provenance triples (chunk derived from page/document)
- Forward to extraction processors

Schema: Chunk

Chunk
├── metadata: Metadata
│   ├── id: str              # Chunk URI
│   ├── user: str
│   ├── collection: str
│   └── metadata: list[Triple]
├── chunk: bytes             # Chunk text content
└── document_id: str         # Librarian chunk ID (e.g., "doc123/p1/c3")

Document ID Hierarchy

Child documents encode their lineage in the ID:

Source: doc123
Page: doc123/p5
Chunk from page: doc123/p5/c2
Chunk from text: doc123/c2

Stage 4: Knowledge Extraction

Multiple extraction patterns available, selected by flow configuration.

Pattern A: Basic GraphRAG

Two parallel processors:

kg-extract-definitions

Input: Chunk
Output: Triples (entity definitions), EntityContexts
Extracts: entity labels, definitions

kg-extract-relationships

Input: Chunk
Output: Triples (relationships), EntityContexts
Extracts: subject-predicate-object relationships

Pattern B: Ontology-Driven (kg-extract-ontology)

Input: Chunk
Output: Triples, EntityContexts
Uses configured ontology to guide extraction

Pattern C: Agent-Based (kg-extract-agent)

Input: Chunk
Output: Triples, EntityContexts
Uses agent framework for extraction

Pattern D: Row Extraction (kg-extract-rows)

Input: Chunk
Output: Rows (structured data, not triples)
Uses schema definition to extract structured records

Schema: Triples

Triples
├── metadata: Metadata
│   ├── id: str
│   ├── user: str
│   ├── collection: str
│   └── metadata: list[Triple]  # (set to [] by extractors)
└── triples: list[Triple]
    └── Triple
        ├── s: Term              # Subject
        ├── p: Term              # Predicate
        ├── o: Term              # Object
        └── g: str | None        # Named graph

Schema: EntityContexts

EntityContexts
├── metadata: Metadata
└── entities: list[EntityContext]
    └── EntityContext
        ├── entity: Term         # Entity identifier (IRI)
        ├── context: str         # Textual description for embedding
        └── chunk_id: str        # Source chunk ID (provenance)

Schema: Rows

Rows
├── metadata: Metadata
├── row_schema: RowSchema
│   ├── name: str
│   ├── description: str
│   └── fields: list[Field]
└── rows: list[dict[str, str]]   # Extracted records

Stage 5: Embeddings Generation

Graph Embeddings

Converts entity contexts into vector embeddings.

Process:

Receive EntityContexts
Call embeddings service with context text
Output GraphEmbeddings (entity → vector mapping)

Schema: GraphEmbeddings

GraphEmbeddings
├── metadata: Metadata
└── entities: list[EntityEmbeddings]
    └── EntityEmbeddings
        ├── entity: Term         # Entity identifier
        ├── vector: list[float]  # Embedding vector
        └── chunk_id: str        # Source chunk (provenance)

Document Embeddings

Converts chunk text directly into vector embeddings.

Process:

Receive Chunk
Call embeddings service with chunk text
Output DocumentEmbeddings

Schema: DocumentEmbeddings

DocumentEmbeddings
├── metadata: Metadata
└── chunks: list[ChunkEmbeddings]
    └── ChunkEmbeddings
        ├── chunk_id: str        # Chunk identifier
        └── vector: list[float]  # Embedding vector

Row Embeddings

Converts row index fields into vector embeddings.

Process:

Receive Rows
Embed configured index fields
Output to row vector store

Stage 6: Storage

Triple Store

Receives: Triples
Storage: Cassandra (entity-centric tables)
Named graphs separate core knowledge from provenance:
- "" (default): Core knowledge facts
- urn:graph:source: Extraction provenance
- urn:graph:retrieval: Query-time explainability

Vector Store (Graph Embeddings)

Receives: GraphEmbeddings
Storage: Qdrant, Milvus, or Pinecone
Indexed by: entity IRI
Metadata: chunk_id for provenance

Vector Store (Document Embeddings)

Receives: DocumentEmbeddings
Storage: Qdrant, Milvus, or Pinecone
Indexed by: chunk_id

Row Store

Receives: Rows
Storage: Cassandra
Schema-driven table structure

Row Vector Store

Receives: Row embeddings
Storage: Vector DB
Indexed by: row index fields

Metadata Field Analysis

Actively Used Fields

Field	Usage
`metadata.id`	Document/chunk identifier, logging, provenance
`metadata.user`	Multi-tenancy, storage routing
`metadata.collection`	Target collection selection
`document_id`	Librarian reference, provenance linking
`chunk_id`	Provenance tracking through pipeline

Potentially Redundant Fields

Field	Status
`metadata.metadata`	Set to `[]` by all extractors; document-level metadata now handled by librarian at submission time

Bytes Fields Pattern

All content fields (data, text, chunk) are bytes but immediately decoded to UTF-8 strings by all processors. No processor uses raw bytes.

Flow Configuration

Flows are defined externally and provided to librarian via config service. Each flow specifies:

Input queues (text-load, document-load)
Processor chain
Parameters (chunk size, extraction method, etc.)

Example flow patterns:

pdf-graphrag: PDF → Decoder → Chunker → Definitions + Relationships → Embeddings
text-graphrag: Text → Chunker → Definitions + Relationships → Embeddings
pdf-ontology: PDF → Decoder → Chunker → Ontology Extraction → Embeddings
text-rows: Text → Chunker → Row Extraction → Row Store

10 KiB Raw Blame History

Extraction Flows

Overview

Content Storage

Blob Storage (S3/Minio)

Metadata Storage (Cassandra)

Inline vs Streaming Threshold

Stage 1: Document Submission (Librarian)

Entry Point

Triggering Extraction

Schema: Document

Stage 2: PDF Decoder

Process

Schema: TextDocument

Stage 3: Chunker

Parameters (flow-configurable)

Process

Schema: Chunk

Document ID Hierarchy

Stage 4: Knowledge Extraction

Pattern A: Basic GraphRAG

Pattern B: Ontology-Driven (kg-extract-ontology)

Pattern C: Agent-Based (kg-extract-agent)

Pattern D: Row Extraction (kg-extract-rows)

Schema: Triples

Schema: EntityContexts

Schema: Rows

Stage 5: Embeddings Generation

Graph Embeddings

Document Embeddings

Row Embeddings

Stage 6: Storage

Triple Store

Vector Store (Graph Embeddings)

Vector Store (Document Embeddings)

Row Store

Row Vector Store

Metadata Field Analysis

Actively Used Fields

Potentially Redundant Fields

Bytes Fields Pattern

Flow Configuration

10 KiB

Raw Blame History