mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-04-25 08:26:21 +02:00

Alex Jenkins 8954fa3ad7 Feat: TrustGraph i18n & Documentation Translation Updates (#781 )

Native CLI i18n: The TrustGraph CLI has built-in translation support
that dynamically loads language strings. You can test and use
different languages by simply passing the --lang flag (e.g., --lang
es for Spanish, --lang ru for Russian) or by configuring your
environment's LANG variable.

Automated Docs Translations: This PR introduces autonomously
translated Markdown documentation into several target languages,
including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew,
Arabic, Simplified Chinese, and Russian.

2026-04-14 12:08:32 +01:00

11 KiB

Raw Blame History

layout	title	parent
default	数据提取流程	Chinese (Beta)

数据提取流程

Beta Translation: This document was translated via Machine Learning and as such may not be 100% accurate. All non-English languages are currently classified as Beta.

本文档描述了数据如何通过 TrustGraph 提取流程进行流动，从文档提交到存储在知识库中。

概述

┌──────────┐     ┌─────────────┐     ┌─────────┐     ┌────────────────────┐
│ Librarian│────▶│ PDF Decoder │────▶│ Chunker │────▶│ Knowledge          │
│          │     │ (PDF only)  │     │         │     │ Extraction         │
│          │────────────────────────▶│         │     │                    │
└──────────┘     └─────────────┘     └─────────┘     └────────────────────┘
                                          │                    │
                                          │                    ├──▶ Triples
                                          │                    ├──▶ Entity Contexts
                                          │                    └──▶ Rows
                                          │
                                          └──▶ Document Embeddings

内容存储

对象存储 (S3/Minio)

文档内容存储在兼容 S3 的对象存储中：路径格式：doc/{object_id}，其中 object_id 是一个 UUID 所有文档类型都存储在此处：源文档、页面、分块

元数据存储 (Cassandra)

文档元数据存储在 Cassandra 中，包括：文档 ID、标题、类型 (MIME 类型) object_id 引用对象存储 parent_id 用于子文档 (页面、分块) document_type： "source", "page", "chunk", "answer"

内联与流式传输阈值

内容传输使用基于大小的策略： < 2MB: 内容以内联方式包含在消息中 (base64 编码) ≥ 2MB: 仅发送 document_id；处理器通过 librarian API 获取

阶段 1：文档提交 (Librarian)

入口点

文档通过 librarian 的 add-document 操作进入系统：

内容上传到对象存储
在 Cassandra 中创建元数据记录
返回文档 ID

触发提取

add-processing 操作触发提取：指定 document_id、flow (pipeline ID)、collection (目标存储) Librarian 的 load_document() 获取内容并发布到 flow 输入队列

模式：Document

Document
├── metadata: Metadata
│   ├── id: str              # Document identifier
│   ├── user: str            # Tenant/user ID
│   ├── collection: str      # Target collection
│   └── metadata: list[Triple]  # (largely unused, historical)
├── data: bytes              # PDF content (base64, if inline)
└── document_id: str         # Librarian reference (if streaming)

路由 (Routing): 基于 kind 字段： application/pdf → document-load 队列 → PDF 解码器 text/plain → text-load 队列 → 分块器

第二阶段：PDF 解码器

将 PDF 文档转换为文本页面。

流程

获取内容（内联 data 或通过 document_id 从管理员处获取）
使用 PyPDF 提取页面
对于每个页面：另存为管理员中的子文档（{doc_id}/p{page_num}）发出来源三元组（页面源自文档）转发到分块器

模式：TextDocument

TextDocument
├── metadata: Metadata
│   ├── id: str              # Page URI (e.g., https://trustgraph.ai/doc/xxx/p1)
│   ├── user: str
│   ├── collection: str
│   └── metadata: list[Triple]
├── text: bytes              # Page text content (if inline)
└── document_id: str         # Librarian reference (e.g., "doc123/p1")

第三阶段：分块器

将文本分割成配置大小的块。

参数（可配置）

chunk_size：目标块大小（以字符为单位）（默认：2000） chunk_overlap：块之间的重叠量（默认：100）

流程

获取文本内容（内联或通过 librarian）
使用递归字符分割器进行分割
对于每个块：另存为 librarian 中的子文档（{parent_id}/c{index}）发出来源三元组（块源自页面/文档）转发到提取处理器

模式：Chunk

Chunk
├── metadata: Metadata
│   ├── id: str              # Chunk URI
│   ├── user: str
│   ├── collection: str
│   └── metadata: list[Triple]
├── chunk: bytes             # Chunk text content
└── document_id: str         # Librarian chunk ID (e.g., "doc123/p1/c3")

文档ID层级结构

子文档在其ID中编码了其来源信息：来源：doc123 页面：doc123/p5 页面中的块：doc123/p5/c2 文本中的块：doc123/c2

第4阶段：知识提取

可用多种提取模式，由流程配置选择。

模式A：基本GraphRAG

两个并行处理器：

kg-extract-definitions 输入：块输出：三元组（实体定义），实体上下文提取内容：实体标签，定义

kg-extract-relationships 输入：块输出：三元组（关系），实体上下文提取内容：主语-谓语-宾语关系

模式B：基于本体论的 (kg-extract-ontology)

输入：块输出：三元组，实体上下文使用配置的本体论来指导提取

模式C：基于代理的 (kg-extract-agent)

输入：块输出：三元组，实体上下文使用代理框架进行提取

模式D：行提取 (kg-extract-rows)

输入：块输出：行（结构化数据，不是三元组）使用模式定义来提取结构化记录

模式：三元组

Triples
├── metadata: Metadata
│   ├── id: str
│   ├── user: str
│   ├── collection: str
│   └── metadata: list[Triple]  # (set to [] by extractors)
└── triples: list[Triple]
    └── Triple
        ├── s: Term              # Subject
        ├── p: Term              # Predicate
        ├── o: Term              # Object
        └── g: str | None        # Named graph

Schema: EntityContexts

EntityContexts
├── metadata: Metadata
└── entities: list[EntityContext]
    └── EntityContext
        ├── entity: Term         # Entity identifier (IRI)
        ├── context: str         # Textual description for embedding
        └── chunk_id: str        # Source chunk ID (provenance)

Schema: Rows

Rows
├── metadata: Metadata
├── row_schema: RowSchema
│   ├── name: str
│   ├── description: str
│   └── fields: list[Field]
└── rows: list[dict[str, str]]   # Extracted records

第 5 阶段：嵌入式表示生成

图嵌入

将实体上下文转换为向量嵌入。

流程：

接收 EntityContexts (实体上下文)
使用上下文文本调用嵌入服务
输出 GraphEmbeddings (实体 → 向量映射)

模式：GraphEmbeddings

GraphEmbeddings
├── metadata: Metadata
└── entities: list[EntityEmbeddings]
    └── EntityEmbeddings
        ├── entity: Term         # Entity identifier
        ├── vector: list[float]  # Embedding vector
        └── chunk_id: str        # Source chunk (provenance)

文档嵌入

将文本块直接转换为向量嵌入。

流程：

接收文本块
使用文本块调用嵌入服务
输出文档嵌入

模式：文档嵌入

DocumentEmbeddings
├── metadata: Metadata
└── chunks: list[ChunkEmbeddings]
    └── ChunkEmbeddings
        ├── chunk_id: str        # Chunk identifier
        └── vector: list[float]  # Embedding vector

行嵌入 (Row Embeddings)

将行索引字段转换为向量嵌入。

流程：

接收行 (Receive Rows)
嵌入配置的索引字段 (Embed configured index fields)
输出到行向量存储 (Output to row vector store)

第 6 阶段：存储 (Stage 6: Storage)

三元组存储 (Triple Store)

接收：三元组 (Receives: Triples) 存储：Cassandra (以实体为中心的表) (Storage: Cassandra (entity-centric tables)) 命名图将核心知识与来源信息分开： (Named graphs separate core knowledge from provenance:) "" (默认): 核心知识事实 (default): Core knowledge facts urn:graph:source: 提取来源 (Extraction provenance) urn:graph:retrieval: 查询时的可解释性 (Query-time explainability)

向量存储 (图嵌入) (Vector Store (Graph Embeddings))

接收：图嵌入 (Receives: GraphEmbeddings) 存储：Qdrant、Milvus 或 Pinecone (Storage: Qdrant, Milvus, or Pinecone) 索引：实体 IRI (Indexed by: entity IRI) 元数据：用于来源信息的 chunk_id (Metadata: chunk_id for provenance)

向量存储 (文档嵌入) (Vector Store (Document Embeddings))

接收：文档嵌入 (Receives: DocumentEmbeddings) 存储：Qdrant、Milvus 或 Pinecone (Storage: Qdrant, Milvus, or Pinecone) 索引：chunk_id (Indexed by: chunk_id)

行存储 (Row Store)

接收：行 (Receives: Rows) 存储：Cassandra (Storage: Cassandra) 基于模式的表结构 (Schema-driven table structure)

行向量存储 (Row Vector Store)

接收：行嵌入存储：向量数据库索引依据：行索引字段

元数据字段分析

正在使用的字段

字段	用途
`metadata.id`	文档/块标识符，日志记录，来源
`metadata.user`	多租户，存储路由
`metadata.collection`	目标集合选择
`document_id`	馆员引用，来源链接
`chunk_id`	通过流水线进行来源跟踪

<<<<<<< HEAD

潜在的冗余字段

字段	状态
`metadata.metadata`	由所有提取器设置为 `[]`；文档级别的元数据现在由馆员在提交时处理

=======

已移除的字段

字段	状态
`metadata.metadata`	从 `Metadata` 类中移除。文档级别的元数据三元组现在由馆员直接发送到三元存储，而不是通过提取流水线。

e3bcbf73 (The metadata field (list of triples) in the pipeline Metadata class)

字节字段模式

所有内容字段（data，text，chunk）都是 bytes，但立即被所有处理器解码为 UTF-8 字符串。没有处理器使用原始字节。

流配置

流在外部定义，并通过配置服务提供给馆员。每个流都指定：

输入队列（text-load，document-load）处理器链参数（块大小，提取方法等）

示例流模式： pdf-graphrag：PDF → 解码器 → 分块器 → 定义 + 关系 → 嵌入 text-graphrag：文本 → 分块器 → 定义 + 关系 → 嵌入 pdf-ontology：PDF → 解码器 → 分块器 → 本体提取 → 嵌入 text-rows：文本 → 分块器 → 行提取 → 行存储

11 KiB Raw Blame History Unescape Escape

数据提取流程

概述

内容存储

对象存储 (S3/Minio)

元数据存储 (Cassandra)

内联与流式传输阈值

阶段 1：文档提交 (Librarian)

入口点

触发提取

模式：Document

第二阶段：PDF 解码器

流程

模式：TextDocument

第三阶段：分块器

参数（可配置）

流程

模式：Chunk

文档ID层级结构

第4阶段：知识提取

模式A：基本GraphRAG

模式B：基于本体论的 (kg-extract-ontology)

模式C：基于代理的 (kg-extract-agent)

模式D：行提取 (kg-extract-rows)

模式：三元组

Schema: EntityContexts

Schema: Rows

第 5 阶段：嵌入式表示生成

图嵌入

文档嵌入

行嵌入 (Row Embeddings)

第 6 阶段：存储 (Stage 6: Storage)

三元组存储 (Triple Store)

向量存储 (图嵌入) (Vector Store (Graph Embeddings))

向量存储 (文档嵌入) (Vector Store (Document Embeddings))

行存储 (Row Store)

行向量存储 (Row Vector Store)

元数据字段分析

正在使用的字段

潜在的冗余字段

已移除的字段

字节字段模式

流配置

11 KiB

Raw Blame History