Dataflow tech spec for extraction (#684)

2026-06-09 06:45:13 +02:00 · 2026-03-11 10:49:50 +00:00 · 2026-03-11 10:49:50 +00:00 · 1837d73f34
commit 1837d73f34
parent ca525b679f
1 changed files with 339 additions and 0 deletions
--- a/docs/tech-specs/extraction-flows.md
+++ b/docs/tech-specs/extraction-flows.md
@ -0,0 +1,339 @@
+# Extraction Flows
+
+This document describes how data flows through the TrustGraph extraction pipeline, from document submission through to storage in knowledge stores.
+
+## Overview
+
+```
+┌──────────┐     ┌─────────────┐     ┌─────────┐     ┌────────────────────┐
+│ Librarian│────▶│ PDF Decoder │────▶│ Chunker │────▶│ Knowledge          │
+│          │     │ (PDF only)  │     │         │     │ Extraction         │
+│          │────────────────────────▶│         │     │                    │
+└──────────┘     └─────────────┘     └─────────┘     └────────────────────┘
+                                          │                    │
+                                          │                    ├──▶ Triples
+                                          │                    ├──▶ Entity Contexts
+                                          │                    └──▶ Rows
+                                          │
+                                          └──▶ Document Embeddings
+```
+
+## Content Storage
+
+### Blob Storage (S3/Minio)
+
+Document content is stored in S3-compatible blob storage:
+- Path format: `doc/{object_id}` where object_id is a UUID
+- All document types stored here: source documents, pages, chunks
+
+### Metadata Storage (Cassandra)
+
+Document metadata stored in Cassandra includes:
+- Document ID, title, kind (MIME type)
+- `object_id` reference to blob storage
+- `parent_id` for child documents (pages, chunks)
+- `document_type`: "source", "page", "chunk", "answer"
+
+### Inline vs Streaming Threshold
+
+Content transmission uses a size-based strategy:
+- **< 2MB**: Content included inline in message (base64-encoded)
+- **≥ 2MB**: Only `document_id` sent; processor fetches via librarian API
+
+## Stage 1: Document Submission (Librarian)
+
+### Entry Point
+
+Documents enter the system via librarian's `add-document` operation:
+1. Content uploaded to blob storage
+2. Metadata record created in Cassandra
+3. Returns document ID
+
+### Triggering Extraction
+
+The `add-processing` operation triggers extraction:
+- Specifies `document_id`, `flow` (pipeline ID), `collection` (target store)
+- Librarian's `load_document()` fetches content and publishes to flow input queue
+
+### Schema: Document
+
+```
+Document
+├── metadata: Metadata
+│   ├── id: str              # Document identifier
+│   ├── user: str            # Tenant/user ID
+│   ├── collection: str      # Target collection
+│   └── metadata: list[Triple]  # (largely unused, historical)
+├── data: bytes              # PDF content (base64, if inline)
+└── document_id: str         # Librarian reference (if streaming)
+```
+
+**Routing**: Based on `kind` field:
+- `application/pdf` → `document-load` queue → PDF Decoder
+- `text/plain` → `text-load` queue → Chunker
+
+## Stage 2: PDF Decoder
+
+Converts PDF documents into text pages.
+
+### Process
+
+1. Fetch content (inline `data` or via `document_id` from librarian)
+2. Extract pages using PyPDF
+3. For each page:
+   - Save as child document in librarian (`{doc_id}/p{page_num}`)
+   - Emit provenance triples (page derived from document)
+   - Forward to chunker
+
+### Schema: TextDocument
+
+```
+TextDocument
+├── metadata: Metadata
+│   ├── id: str              # Page URI (e.g., https://trustgraph.ai/doc/xxx/p1)
+│   ├── user: str
+│   ├── collection: str
+│   └── metadata: list[Triple]
+├── text: bytes              # Page text content (if inline)
+└── document_id: str         # Librarian reference (e.g., "doc123/p1")
+```
+
+## Stage 3: Chunker
+
+Splits text into chunks at configured size.
+
+### Parameters (flow-configurable)
+
+- `chunk_size`: Target chunk size in characters (default: 2000)
+- `chunk_overlap`: Overlap between chunks (default: 100)
+
+### Process
+
+1. Fetch text content (inline or via librarian)
+2. Split using recursive character splitter
+3. For each chunk:
+   - Save as child document in librarian (`{parent_id}/c{index}`)
+   - Emit provenance triples (chunk derived from page/document)
+   - Forward to extraction processors
+
+### Schema: Chunk
+
+```
+Chunk
+├── metadata: Metadata
+│   ├── id: str              # Chunk URI
+│   ├── user: str
+│   ├── collection: str
+│   └── metadata: list[Triple]
+├── chunk: bytes             # Chunk text content
+└── document_id: str         # Librarian chunk ID (e.g., "doc123/p1/c3")
+```
+
+### Document ID Hierarchy
+
+Child documents encode their lineage in the ID:
+- Source: `doc123`
+- Page: `doc123/p5`
+- Chunk from page: `doc123/p5/c2`
+- Chunk from text: `doc123/c2`
+
+## Stage 4: Knowledge Extraction
+
+Multiple extraction patterns available, selected by flow configuration.
+
+### Pattern A: Basic GraphRAG
+
+Two parallel processors:
+
+**kg-extract-definitions**
+- Input: Chunk
+- Output: Triples (entity definitions), EntityContexts
+- Extracts: entity labels, definitions
+
+**kg-extract-relationships**
+- Input: Chunk
+- Output: Triples (relationships), EntityContexts
+- Extracts: subject-predicate-object relationships
+
+### Pattern B: Ontology-Driven (kg-extract-ontology)
+
+- Input: Chunk
+- Output: Triples, EntityContexts
+- Uses configured ontology to guide extraction
+
+### Pattern C: Agent-Based (kg-extract-agent)
+
+- Input: Chunk
+- Output: Triples, EntityContexts
+- Uses agent framework for extraction
+
+### Pattern D: Row Extraction (kg-extract-rows)
+
+- Input: Chunk
+- Output: Rows (structured data, not triples)
+- Uses schema definition to extract structured records
+
+### Schema: Triples
+
+```
+Triples
+├── metadata: Metadata
+│   ├── id: str
+│   ├── user: str
+│   ├── collection: str
+│   └── metadata: list[Triple]  # (set to [] by extractors)
+└── triples: list[Triple]
+    └── Triple
+        ├── s: Term              # Subject
+        ├── p: Term              # Predicate
+        ├── o: Term              # Object
+        └── g: str | None        # Named graph
+```
+
+### Schema: EntityContexts
+
+```
+EntityContexts
+├── metadata: Metadata
+└── entities: list[EntityContext]
+    └── EntityContext
+        ├── entity: Term         # Entity identifier (IRI)
+        ├── context: str         # Textual description for embedding
+        └── chunk_id: str        # Source chunk ID (provenance)
+```
+
+### Schema: Rows
+
+```
+Rows
+├── metadata: Metadata
+├── row_schema: RowSchema
+│   ├── name: str
+│   ├── description: str
+│   └── fields: list[Field]
+└── rows: list[dict[str, str]]   # Extracted records
+```
+
+## Stage 5: Embeddings Generation
+
+### Graph Embeddings
+
+Converts entity contexts into vector embeddings.
+
+**Process:**
+1. Receive EntityContexts
+2. Call embeddings service with context text
+3. Output GraphEmbeddings (entity → vector mapping)
+
+**Schema: GraphEmbeddings**
+
+```
+GraphEmbeddings
+├── metadata: Metadata
+└── entities: list[EntityEmbeddings]
+    └── EntityEmbeddings
+        ├── entity: Term         # Entity identifier
+        ├── vector: list[float]  # Embedding vector
+        └── chunk_id: str        # Source chunk (provenance)
+```
+
+### Document Embeddings
+
+Converts chunk text directly into vector embeddings.
+
+**Process:**
+1. Receive Chunk
+2. Call embeddings service with chunk text
+3. Output DocumentEmbeddings
+
+**Schema: DocumentEmbeddings**
+
+```
+DocumentEmbeddings
+├── metadata: Metadata
+└── chunks: list[ChunkEmbeddings]
+    └── ChunkEmbeddings
+        ├── chunk_id: str        # Chunk identifier
+        └── vector: list[float]  # Embedding vector
+```
+
+### Row Embeddings
+
+Converts row index fields into vector embeddings.
+
+**Process:**
+1. Receive Rows
+2. Embed configured index fields
+3. Output to row vector store
+
+## Stage 6: Storage
+
+### Triple Store
+
+- Receives: Triples
+- Storage: Cassandra (entity-centric tables)
+- Named graphs separate core knowledge from provenance:
+  - `""` (default): Core knowledge facts
+  - `urn:graph:source`: Extraction provenance
+  - `urn:graph:retrieval`: Query-time explainability
+
+### Vector Store (Graph Embeddings)
+
+- Receives: GraphEmbeddings
+- Storage: Qdrant, Milvus, or Pinecone
+- Indexed by: entity IRI
+- Metadata: chunk_id for provenance
+
+### Vector Store (Document Embeddings)
+
+- Receives: DocumentEmbeddings
+- Storage: Qdrant, Milvus, or Pinecone
+- Indexed by: chunk_id
+
+### Row Store
+
+- Receives: Rows
+- Storage: Cassandra
+- Schema-driven table structure
+
+### Row Vector Store
+
+- Receives: Row embeddings
+- Storage: Vector DB
+- Indexed by: row index fields
+
+## Metadata Field Analysis
+
+### Actively Used Fields
+
+| Field | Usage |
+|-------|-------|
+| `metadata.id` | Document/chunk identifier, logging, provenance |
+| `metadata.user` | Multi-tenancy, storage routing |
+| `metadata.collection` | Target collection selection |
+| `document_id` | Librarian reference, provenance linking |
+| `chunk_id` | Provenance tracking through pipeline |
+
+### Potentially Redundant Fields
+
+| Field | Status |
+|-------|--------|
+| `metadata.metadata` | Set to `[]` by all extractors; document-level metadata now handled by librarian at submission time |
+
+### Bytes Fields Pattern
+
+All content fields (`data`, `text`, `chunk`) are `bytes` but immediately decoded to UTF-8 strings by all processors. No processor uses raw bytes.
+
+## Flow Configuration
+
+Flows are defined externally and provided to librarian via config service. Each flow specifies:
+
+- Input queues (`text-load`, `document-load`)
+- Processor chain
+- Parameters (chunk size, extraction method, etc.)
+
+Example flow patterns:
+- `pdf-graphrag`: PDF → Decoder → Chunker → Definitions + Relationships → Embeddings
+- `text-graphrag`: Text → Chunker → Definitions + Relationships → Embeddings
+- `pdf-ontology`: PDF → Decoder → Chunker → Ontology Extraction → Embeddings
+- `text-rows`: Text → Chunker → Row Extraction → Row Store