trustgraph/docs/tech-specs/extraction-flows.md

# Extraction Flows

This document describes how data flows through the TrustGraph extraction pipeline, from document submission through to storage in knowledge stores.

## Overview

```
┌──────────┐     ┌─────────────┐     ┌─────────┐     ┌────────────────────┐
│ Librarian│────▶│ PDF Decoder │────▶│ Chunker │────▶│ Knowledge          │
│          │     │ (PDF only)  │     │         │     │ Extraction         │
│          │────────────────────────▶│         │     │                    │
└──────────┘     └─────────────┘     └─────────┘     └────────────────────┘
                                          │                    │
                                          │                    ├──▶ Triples
                                          │                    ├──▶ Entity Contexts
                                          │                    └──▶ Rows
                                          │
                                          └──▶ Document Embeddings
```

## Content Storage

### Blob Storage (S3/Minio)

Document content is stored in S3-compatible blob storage:
- Path format: `doc/{object_id}` where object_id is a UUID
- All document types stored here: source documents, pages, chunks

### Metadata Storage (Cassandra)

Document metadata stored in Cassandra includes:
- Document ID, title, kind (MIME type)
- `object_id` reference to blob storage
- `parent_id` for child documents (pages, chunks)
- `document_type`: "source", "page", "chunk", "answer"

### Inline vs Streaming Threshold

Content transmission uses a size-based strategy:
- **< 2MB**: Content included inline in message (base64-encoded)
- **≥ 2MB**: Only `document_id` sent; processor fetches via librarian API

## Stage 1: Document Submission (Librarian)

### Entry Point

Documents enter the system via librarian's `add-document` operation:
1. Content uploaded to blob storage
2. Metadata record created in Cassandra
3. Returns document ID

### Triggering Extraction

The `add-processing` operation triggers extraction:
- Specifies `document_id`, `flow` (pipeline ID), `collection` (target store)
- Librarian's `load_document()` fetches content and publishes to flow input queue

### Schema: Document

```
Document
├── metadata: Metadata
│   ├── id: str              # Document identifier
│   ├── user: str            # Tenant/user ID
│   ├── collection: str      # Target collection
│   └── metadata: list[Triple]  # (largely unused, historical)
├── data: bytes              # PDF content (base64, if inline)
└── document_id: str         # Librarian reference (if streaming)
```

**Routing**: Based on `kind` field:
- `application/pdf` → `document-load` queue → PDF Decoder
- `text/plain` → `text-load` queue → Chunker

## Stage 2: PDF Decoder

Converts PDF documents into text pages.

### Process

1. Fetch content (inline `data` or via `document_id` from librarian)
2. Extract pages using PyPDF
3. For each page:
   - Save as child document in librarian (`{doc_id}/p{page_num}`)
   - Emit provenance triples (page derived from document)
   - Forward to chunker

### Schema: TextDocument

```
TextDocument
├── metadata: Metadata
│   ├── id: str              # Page URI (e.g., https://trustgraph.ai/doc/xxx/p1)
│   ├── user: str
│   ├── collection: str
│   └── metadata: list[Triple]
├── text: bytes              # Page text content (if inline)
└── document_id: str         # Librarian reference (e.g., "doc123/p1")
```

## Stage 3: Chunker

Splits text into chunks at configured size.

### Parameters (flow-configurable)

- `chunk_size`: Target chunk size in characters (default: 2000)
- `chunk_overlap`: Overlap between chunks (default: 100)

### Process

1. Fetch text content (inline or via librarian)
2. Split using recursive character splitter
3. For each chunk:
   - Save as child document in librarian (`{parent_id}/c{index}`)
   - Emit provenance triples (chunk derived from page/document)
   - Forward to extraction processors

### Schema: Chunk

```
Chunk
├── metadata: Metadata
│   ├── id: str              # Chunk URI
│   ├── user: str
│   ├── collection: str
│   └── metadata: list[Triple]
├── chunk: bytes             # Chunk text content
└── document_id: str         # Librarian chunk ID (e.g., "doc123/p1/c3")
```

### Document ID Hierarchy

Child documents encode their lineage in the ID:
- Source: `doc123`
- Page: `doc123/p5`
- Chunk from page: `doc123/p5/c2`
- Chunk from text: `doc123/c2`

## Stage 4: Knowledge Extraction

Multiple extraction patterns available, selected by flow configuration.

### Pattern A: Basic GraphRAG

Two parallel processors:

**kg-extract-definitions**
- Input: Chunk
- Output: Triples (entity definitions), EntityContexts
- Extracts: entity labels, definitions

**kg-extract-relationships**
- Input: Chunk
- Output: Triples (relationships), EntityContexts
- Extracts: subject-predicate-object relationships

### Pattern B: Ontology-Driven (kg-extract-ontology)

- Input: Chunk
- Output: Triples, EntityContexts
- Uses configured ontology to guide extraction

### Pattern C: Agent-Based (kg-extract-agent)

- Input: Chunk
- Output: Triples, EntityContexts
- Uses agent framework for extraction

### Pattern D: Row Extraction (kg-extract-rows)

- Input: Chunk
- Output: Rows (structured data, not triples)
- Uses schema definition to extract structured records

### Schema: Triples

```
Triples
├── metadata: Metadata
│   ├── id: str
│   ├── user: str
│   ├── collection: str
│   └── metadata: list[Triple]  # (set to [] by extractors)
└── triples: list[Triple]
    └── Triple
        ├── s: Term              # Subject
        ├── p: Term              # Predicate
        ├── o: Term              # Object
        └── g: str | None        # Named graph
```

### Schema: EntityContexts

```
EntityContexts
├── metadata: Metadata
└── entities: list[EntityContext]
    └── EntityContext
        ├── entity: Term         # Entity identifier (IRI)
        ├── context: str         # Textual description for embedding
        └── chunk_id: str        # Source chunk ID (provenance)
```

### Schema: Rows

```
Rows
├── metadata: Metadata
├── row_schema: RowSchema
│   ├── name: str
│   ├── description: str
│   └── fields: list[Field]
└── rows: list[dict[str, str]]   # Extracted records
```

## Stage 5: Embeddings Generation

### Graph Embeddings

Converts entity contexts into vector embeddings.

**Process:**
1. Receive EntityContexts
2. Call embeddings service with context text
3. Output GraphEmbeddings (entity → vector mapping)

**Schema: GraphEmbeddings**

```
GraphEmbeddings
├── metadata: Metadata
└── entities: list[EntityEmbeddings]
    └── EntityEmbeddings
        ├── entity: Term         # Entity identifier
        ├── vector: list[float]  # Embedding vector
        └── chunk_id: str        # Source chunk (provenance)
```

### Document Embeddings

Converts chunk text directly into vector embeddings.

**Process:**
1. Receive Chunk
2. Call embeddings service with chunk text
3. Output DocumentEmbeddings

**Schema: DocumentEmbeddings**

```
DocumentEmbeddings
├── metadata: Metadata
└── chunks: list[ChunkEmbeddings]
    └── ChunkEmbeddings
        ├── chunk_id: str        # Chunk identifier
        └── vector: list[float]  # Embedding vector
```

### Row Embeddings

Converts row index fields into vector embeddings.

**Process:**
1. Receive Rows
2. Embed configured index fields
3. Output to row vector store

## Stage 6: Storage

### Triple Store

- Receives: Triples
- Storage: Cassandra (entity-centric tables)
- Named graphs separate core knowledge from provenance:
  - `""` (default): Core knowledge facts
  - `urn:graph:source`: Extraction provenance
  - `urn:graph:retrieval`: Query-time explainability

### Vector Store (Graph Embeddings)

- Receives: GraphEmbeddings
- Storage: Qdrant, Milvus, or Pinecone
- Indexed by: entity IRI
- Metadata: chunk_id for provenance

### Vector Store (Document Embeddings)

- Receives: DocumentEmbeddings
- Storage: Qdrant, Milvus, or Pinecone
- Indexed by: chunk_id

### Row Store

- Receives: Rows
- Storage: Cassandra
- Schema-driven table structure

### Row Vector Store

- Receives: Row embeddings
- Storage: Vector DB
- Indexed by: row index fields

## Metadata Field Analysis

### Actively Used Fields

| Field | Usage |
|-------|-------|
| `metadata.id` | Document/chunk identifier, logging, provenance |
| `metadata.user` | Multi-tenancy, storage routing |
| `metadata.collection` | Target collection selection |
| `document_id` | Librarian reference, provenance linking |
| `chunk_id` | Provenance tracking through pipeline |

<<<<<<< HEAD
### Potentially Redundant Fields

| Field | Status |
|-------|--------|
| `metadata.metadata` | Set to `[]` by all extractors; document-level metadata now handled by librarian at submission time |
=======
### Removed Fields

| Field | Status |
|-------|--------|
| `metadata.metadata` | Removed from `Metadata` class. Document-level metadata triples are now emitted directly by librarian to triple store at submission time, not carried through the extraction pipeline. |
>>>>>>> e3bcbf73 (The metadata field (list of triples) in the pipeline Metadata class)

### Bytes Fields Pattern

All content fields (`data`, `text`, `chunk`) are `bytes` but immediately decoded to UTF-8 strings by all processors. No processor uses raw bytes.

## Flow Configuration

Flows are defined externally and provided to librarian via config service. Each flow specifies:

- Input queues (`text-load`, `document-load`)
- Processor chain
- Parameters (chunk size, extraction method, etc.)

Example flow patterns:
- `pdf-graphrag`: PDF → Decoder → Chunker → Definitions + Relationships → Embeddings
- `text-graphrag`: Text → Chunker → Definitions + Relationships → Embeddings
- `pdf-ontology`: PDF → Decoder → Chunker → Ontology Extraction → Embeddings
- `text-rows`: Text → Chunker → Row Extraction → Row Store