mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 00:16:23 +02:00
Dataflow tech spec for extraction (#684)
This commit is contained in:
parent
ca525b679f
commit
1837d73f34
1 changed files with 339 additions and 0 deletions
339
docs/tech-specs/extraction-flows.md
Normal file
339
docs/tech-specs/extraction-flows.md
Normal file
|
|
@ -0,0 +1,339 @@
|
|||
# Extraction Flows
|
||||
|
||||
This document describes how data flows through the TrustGraph extraction pipeline, from document submission through to storage in knowledge stores.
|
||||
|
||||
## Overview
|
||||
|
||||
```
|
||||
┌──────────┐ ┌─────────────┐ ┌─────────┐ ┌────────────────────┐
|
||||
│ Librarian│────▶│ PDF Decoder │────▶│ Chunker │────▶│ Knowledge │
|
||||
│ │ │ (PDF only) │ │ │ │ Extraction │
|
||||
│ │────────────────────────▶│ │ │ │
|
||||
└──────────┘ └─────────────┘ └─────────┘ └────────────────────┘
|
||||
│ │
|
||||
│ ├──▶ Triples
|
||||
│ ├──▶ Entity Contexts
|
||||
│ └──▶ Rows
|
||||
│
|
||||
└──▶ Document Embeddings
|
||||
```
|
||||
|
||||
## Content Storage
|
||||
|
||||
### Blob Storage (S3/Minio)
|
||||
|
||||
Document content is stored in S3-compatible blob storage:
|
||||
- Path format: `doc/{object_id}` where object_id is a UUID
|
||||
- All document types stored here: source documents, pages, chunks
|
||||
|
||||
### Metadata Storage (Cassandra)
|
||||
|
||||
Document metadata stored in Cassandra includes:
|
||||
- Document ID, title, kind (MIME type)
|
||||
- `object_id` reference to blob storage
|
||||
- `parent_id` for child documents (pages, chunks)
|
||||
- `document_type`: "source", "page", "chunk", "answer"
|
||||
|
||||
### Inline vs Streaming Threshold
|
||||
|
||||
Content transmission uses a size-based strategy:
|
||||
- **< 2MB**: Content included inline in message (base64-encoded)
|
||||
- **≥ 2MB**: Only `document_id` sent; processor fetches via librarian API
|
||||
|
||||
## Stage 1: Document Submission (Librarian)
|
||||
|
||||
### Entry Point
|
||||
|
||||
Documents enter the system via librarian's `add-document` operation:
|
||||
1. Content uploaded to blob storage
|
||||
2. Metadata record created in Cassandra
|
||||
3. Returns document ID
|
||||
|
||||
### Triggering Extraction
|
||||
|
||||
The `add-processing` operation triggers extraction:
|
||||
- Specifies `document_id`, `flow` (pipeline ID), `collection` (target store)
|
||||
- Librarian's `load_document()` fetches content and publishes to flow input queue
|
||||
|
||||
### Schema: Document
|
||||
|
||||
```
|
||||
Document
|
||||
├── metadata: Metadata
|
||||
│ ├── id: str # Document identifier
|
||||
│ ├── user: str # Tenant/user ID
|
||||
│ ├── collection: str # Target collection
|
||||
│ └── metadata: list[Triple] # (largely unused, historical)
|
||||
├── data: bytes # PDF content (base64, if inline)
|
||||
└── document_id: str # Librarian reference (if streaming)
|
||||
```
|
||||
|
||||
**Routing**: Based on `kind` field:
|
||||
- `application/pdf` → `document-load` queue → PDF Decoder
|
||||
- `text/plain` → `text-load` queue → Chunker
|
||||
|
||||
## Stage 2: PDF Decoder
|
||||
|
||||
Converts PDF documents into text pages.
|
||||
|
||||
### Process
|
||||
|
||||
1. Fetch content (inline `data` or via `document_id` from librarian)
|
||||
2. Extract pages using PyPDF
|
||||
3. For each page:
|
||||
- Save as child document in librarian (`{doc_id}/p{page_num}`)
|
||||
- Emit provenance triples (page derived from document)
|
||||
- Forward to chunker
|
||||
|
||||
### Schema: TextDocument
|
||||
|
||||
```
|
||||
TextDocument
|
||||
├── metadata: Metadata
|
||||
│ ├── id: str # Page URI (e.g., https://trustgraph.ai/doc/xxx/p1)
|
||||
│ ├── user: str
|
||||
│ ├── collection: str
|
||||
│ └── metadata: list[Triple]
|
||||
├── text: bytes # Page text content (if inline)
|
||||
└── document_id: str # Librarian reference (e.g., "doc123/p1")
|
||||
```
|
||||
|
||||
## Stage 3: Chunker
|
||||
|
||||
Splits text into chunks at configured size.
|
||||
|
||||
### Parameters (flow-configurable)
|
||||
|
||||
- `chunk_size`: Target chunk size in characters (default: 2000)
|
||||
- `chunk_overlap`: Overlap between chunks (default: 100)
|
||||
|
||||
### Process
|
||||
|
||||
1. Fetch text content (inline or via librarian)
|
||||
2. Split using recursive character splitter
|
||||
3. For each chunk:
|
||||
- Save as child document in librarian (`{parent_id}/c{index}`)
|
||||
- Emit provenance triples (chunk derived from page/document)
|
||||
- Forward to extraction processors
|
||||
|
||||
### Schema: Chunk
|
||||
|
||||
```
|
||||
Chunk
|
||||
├── metadata: Metadata
|
||||
│ ├── id: str # Chunk URI
|
||||
│ ├── user: str
|
||||
│ ├── collection: str
|
||||
│ └── metadata: list[Triple]
|
||||
├── chunk: bytes # Chunk text content
|
||||
└── document_id: str # Librarian chunk ID (e.g., "doc123/p1/c3")
|
||||
```
|
||||
|
||||
### Document ID Hierarchy
|
||||
|
||||
Child documents encode their lineage in the ID:
|
||||
- Source: `doc123`
|
||||
- Page: `doc123/p5`
|
||||
- Chunk from page: `doc123/p5/c2`
|
||||
- Chunk from text: `doc123/c2`
|
||||
|
||||
## Stage 4: Knowledge Extraction
|
||||
|
||||
Multiple extraction patterns available, selected by flow configuration.
|
||||
|
||||
### Pattern A: Basic GraphRAG
|
||||
|
||||
Two parallel processors:
|
||||
|
||||
**kg-extract-definitions**
|
||||
- Input: Chunk
|
||||
- Output: Triples (entity definitions), EntityContexts
|
||||
- Extracts: entity labels, definitions
|
||||
|
||||
**kg-extract-relationships**
|
||||
- Input: Chunk
|
||||
- Output: Triples (relationships), EntityContexts
|
||||
- Extracts: subject-predicate-object relationships
|
||||
|
||||
### Pattern B: Ontology-Driven (kg-extract-ontology)
|
||||
|
||||
- Input: Chunk
|
||||
- Output: Triples, EntityContexts
|
||||
- Uses configured ontology to guide extraction
|
||||
|
||||
### Pattern C: Agent-Based (kg-extract-agent)
|
||||
|
||||
- Input: Chunk
|
||||
- Output: Triples, EntityContexts
|
||||
- Uses agent framework for extraction
|
||||
|
||||
### Pattern D: Row Extraction (kg-extract-rows)
|
||||
|
||||
- Input: Chunk
|
||||
- Output: Rows (structured data, not triples)
|
||||
- Uses schema definition to extract structured records
|
||||
|
||||
### Schema: Triples
|
||||
|
||||
```
|
||||
Triples
|
||||
├── metadata: Metadata
|
||||
│ ├── id: str
|
||||
│ ├── user: str
|
||||
│ ├── collection: str
|
||||
│ └── metadata: list[Triple] # (set to [] by extractors)
|
||||
└── triples: list[Triple]
|
||||
└── Triple
|
||||
├── s: Term # Subject
|
||||
├── p: Term # Predicate
|
||||
├── o: Term # Object
|
||||
└── g: str | None # Named graph
|
||||
```
|
||||
|
||||
### Schema: EntityContexts
|
||||
|
||||
```
|
||||
EntityContexts
|
||||
├── metadata: Metadata
|
||||
└── entities: list[EntityContext]
|
||||
└── EntityContext
|
||||
├── entity: Term # Entity identifier (IRI)
|
||||
├── context: str # Textual description for embedding
|
||||
└── chunk_id: str # Source chunk ID (provenance)
|
||||
```
|
||||
|
||||
### Schema: Rows
|
||||
|
||||
```
|
||||
Rows
|
||||
├── metadata: Metadata
|
||||
├── row_schema: RowSchema
|
||||
│ ├── name: str
|
||||
│ ├── description: str
|
||||
│ └── fields: list[Field]
|
||||
└── rows: list[dict[str, str]] # Extracted records
|
||||
```
|
||||
|
||||
## Stage 5: Embeddings Generation
|
||||
|
||||
### Graph Embeddings
|
||||
|
||||
Converts entity contexts into vector embeddings.
|
||||
|
||||
**Process:**
|
||||
1. Receive EntityContexts
|
||||
2. Call embeddings service with context text
|
||||
3. Output GraphEmbeddings (entity → vector mapping)
|
||||
|
||||
**Schema: GraphEmbeddings**
|
||||
|
||||
```
|
||||
GraphEmbeddings
|
||||
├── metadata: Metadata
|
||||
└── entities: list[EntityEmbeddings]
|
||||
└── EntityEmbeddings
|
||||
├── entity: Term # Entity identifier
|
||||
├── vector: list[float] # Embedding vector
|
||||
└── chunk_id: str # Source chunk (provenance)
|
||||
```
|
||||
|
||||
### Document Embeddings
|
||||
|
||||
Converts chunk text directly into vector embeddings.
|
||||
|
||||
**Process:**
|
||||
1. Receive Chunk
|
||||
2. Call embeddings service with chunk text
|
||||
3. Output DocumentEmbeddings
|
||||
|
||||
**Schema: DocumentEmbeddings**
|
||||
|
||||
```
|
||||
DocumentEmbeddings
|
||||
├── metadata: Metadata
|
||||
└── chunks: list[ChunkEmbeddings]
|
||||
└── ChunkEmbeddings
|
||||
├── chunk_id: str # Chunk identifier
|
||||
└── vector: list[float] # Embedding vector
|
||||
```
|
||||
|
||||
### Row Embeddings
|
||||
|
||||
Converts row index fields into vector embeddings.
|
||||
|
||||
**Process:**
|
||||
1. Receive Rows
|
||||
2. Embed configured index fields
|
||||
3. Output to row vector store
|
||||
|
||||
## Stage 6: Storage
|
||||
|
||||
### Triple Store
|
||||
|
||||
- Receives: Triples
|
||||
- Storage: Cassandra (entity-centric tables)
|
||||
- Named graphs separate core knowledge from provenance:
|
||||
- `""` (default): Core knowledge facts
|
||||
- `urn:graph:source`: Extraction provenance
|
||||
- `urn:graph:retrieval`: Query-time explainability
|
||||
|
||||
### Vector Store (Graph Embeddings)
|
||||
|
||||
- Receives: GraphEmbeddings
|
||||
- Storage: Qdrant, Milvus, or Pinecone
|
||||
- Indexed by: entity IRI
|
||||
- Metadata: chunk_id for provenance
|
||||
|
||||
### Vector Store (Document Embeddings)
|
||||
|
||||
- Receives: DocumentEmbeddings
|
||||
- Storage: Qdrant, Milvus, or Pinecone
|
||||
- Indexed by: chunk_id
|
||||
|
||||
### Row Store
|
||||
|
||||
- Receives: Rows
|
||||
- Storage: Cassandra
|
||||
- Schema-driven table structure
|
||||
|
||||
### Row Vector Store
|
||||
|
||||
- Receives: Row embeddings
|
||||
- Storage: Vector DB
|
||||
- Indexed by: row index fields
|
||||
|
||||
## Metadata Field Analysis
|
||||
|
||||
### Actively Used Fields
|
||||
|
||||
| Field | Usage |
|
||||
|-------|-------|
|
||||
| `metadata.id` | Document/chunk identifier, logging, provenance |
|
||||
| `metadata.user` | Multi-tenancy, storage routing |
|
||||
| `metadata.collection` | Target collection selection |
|
||||
| `document_id` | Librarian reference, provenance linking |
|
||||
| `chunk_id` | Provenance tracking through pipeline |
|
||||
|
||||
### Potentially Redundant Fields
|
||||
|
||||
| Field | Status |
|
||||
|-------|--------|
|
||||
| `metadata.metadata` | Set to `[]` by all extractors; document-level metadata now handled by librarian at submission time |
|
||||
|
||||
### Bytes Fields Pattern
|
||||
|
||||
All content fields (`data`, `text`, `chunk`) are `bytes` but immediately decoded to UTF-8 strings by all processors. No processor uses raw bytes.
|
||||
|
||||
## Flow Configuration
|
||||
|
||||
Flows are defined externally and provided to librarian via config service. Each flow specifies:
|
||||
|
||||
- Input queues (`text-load`, `document-load`)
|
||||
- Processor chain
|
||||
- Parameters (chunk size, extraction method, etc.)
|
||||
|
||||
Example flow patterns:
|
||||
- `pdf-graphrag`: PDF → Decoder → Chunker → Definitions + Relationships → Embeddings
|
||||
- `text-graphrag`: Text → Chunker → Definitions + Relationships → Embeddings
|
||||
- `pdf-ontology`: PDF → Decoder → Chunker → Ontology Extraction → Embeddings
|
||||
- `text-rows`: Text → Chunker → Row Extraction → Row Store
|
||||
Loading…
Add table
Add a link
Reference in a new issue