mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-04-25 08:26:21 +02:00

Alex Jenkins 8954fa3ad7 Feat: TrustGraph i18n & Documentation Translation Updates (#781 )

Native CLI i18n: The TrustGraph CLI has built-in translation support
that dynamically loads language strings. You can test and use
different languages by simply passing the --lang flag (e.g., --lang
es for Spanish, --lang ru for Russian) or by configuring your
environment's LANG variable.

Automated Docs Translations: This PR introduces autonomously
translated Markdown documentation into several target languages,
including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew,
Arabic, Simplified Chinese, and Russian.

2026-04-14 12:08:32 +01:00

22 KiB

Raw Blame History

layout	title	parent
default	本体知识抽取 - 第二阶段重构	Chinese (Beta)

本体知识抽取 - 第二阶段重构

Beta Translation: This document was translated via Machine Learning and as such may not be 100% accurate. All non-English languages are currently classified as Beta.

状态: 草稿作者: 分析会议 2025-12-03 相关: ontology.md, ontorag.md

概述

本文档识别了当前基于本体的知识抽取系统中存在的缺陷，并提出了重构方案，以提高 LLM 的性能并减少信息损失。

当前实现

当前工作方式

本体加载 (ontology_loader.py) 加载本体 JSON 文件，其中包含键，例如 "fo/Recipe", "fo/Food", "fo/produces" 类 ID 在键本身中包含命名空间前缀示例来自 food.ontology:
```
"classes": {
  "fo/Recipe": {
    "uri": "http://purl.org/ontology/fo/Recipe",
    "rdfs:comment": "A Recipe is a combination..."
  }
}
```
提示构建 (extract.py:299-307, ontology-prompt.md) 模板接收 classes, object_properties, datatype_properties 字典模板循环：{% for class_id, class_def in classes.items() %} LLM 看到：**fo/Recipe**: A Recipe is a combination... 示例输出格式如下：
```
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"}
{"subject": "recipe:cornish-pasty", "predicate": "has_ingredient", "object": "ingredient:flour"}
```
响应解析 (extract.py:382-428) 期望接收 JSON 数组: [{"subject": "...", "predicate": "...", "object": "..."}] 验证是否符合本体子集通过 expand_uri() 扩展 URI (extract.py:473-521)
URI 扩展 (extract.py:473-521) 检查值是否在 ontology_subset.classes 字典中如果找到，从类定义中提取 URI 如果未找到，则构建 URI: f"https://trustgraph.ai/ontology/{ontology_id}#{value}"

数据流示例

本体 JSON → 加载器 → 提示:

"fo/Recipe" → classes["fo/Recipe"] → LLM sees "**fo/Recipe**"

LLM → 解析器 → 输出:

"Recipe" → not in classes["fo/Recipe"] → constructs URI → LOSES original URI
"fo/Recipe" → found in classes → uses original URI → PRESERVES URI

发现的问题

1. 提示语中的示例不一致

问题: 提示模板显示带有前缀的类ID (fo/Recipe)，但示例输出使用不带前缀的类名 (Recipe)。

位置: ontology-prompt.md:5-52

## Ontology Classes:
- **fo/Recipe**: A Recipe is...

## Example Output:
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"}

影响： LLM 接收到关于应该使用哪种格式的冲突信号。

2. URI 扩展中的信息丢失

问题： 当 LLM 返回不带前缀的类名，例如示例中的情况时，expand_uri() 无法在本体字典中找到它们，而是构造了备用 URI，从而丢失了原始的正确 URI。

位置： extract.py:494-500

if value in ontology_subset.classes:  # Looks for "Recipe"
    class_def = ontology_subset.classes[value]  # But key is "fo/Recipe"
    if isinstance(class_def, dict) and 'uri' in class_def:
        return class_def['uri']  # Never reached!
return f"https://trustgraph.ai/ontology/{ontology_id}#{value}"  # Fallback

影响 (Impact): 原始 URI: http://purl.org/ontology/fo/Recipe 构建的 URI: https://trustgraph.ai/ontology/food#Recipe 语义信息丢失，破坏互操作性

3. 实体实例格式不明确 (Ambiguous Entity Instance Format)

问题 (Issue): 没有关于实体实例 URI 格式的明确指导。

提示中的示例 (Examples in prompt): "recipe:cornish-pasty" (类似于命名空间的前缀) "ingredient:flour" (不同的前缀)

实际行为 (Actual behavior) (extract.py:517-520):

# Treat as entity instance - construct unique URI
normalized = value.replace(" ", "-").lower()
return f"https://trustgraph.ai/{ontology_id}/{normalized}"

影响： LLM 必须在没有任何本体知识的情况下猜测前缀约定。

4. 没有命名空间前缀的指导

问题： 本体 JSON 包含命名空间定义（food.ontology 中的第 10-25 行）：

"namespaces": {
  "fo": "http://purl.org/ontology/fo/",
  "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
  ...
}

但是这些内容永远不会传递给大型语言模型。大型语言模型不知道： "fo" 的含义应使用哪个前缀来表示实体哪个命名空间适用于哪些元素

5. 未在提示中使用标签

问题： 每一个类都有 rdfs:label 字段（例如，{"value": "Recipe", "lang": "en-gb"}），但提示模板没有使用它们。

当前： 只显示 class_id 和 comment

- **{{class_id}}**{% if class_def.comment %}: {{class_def.comment}}{% endif %}

可用但未使用的：

"rdfs:label": [{"value": "Recipe", "lang": "en-gb"}]

影响： 可以为技术 ID 旁边提供人类可读的名称。

提出的解决方案

方案 A：标准化为不带前缀的 ID

方法： 在向 LLM 显示之前，从类 ID 中移除前缀。

变更：

修改 build_extraction_variables() 以转换键：

classes_for_prompt = {
    k.split('/')[-1]: v  # "fo/Recipe" → "Recipe"
    for k, v in ontology_subset.classes.items()
}

将提示示例更新为匹配项（已使用未加前缀的名称）。

修改 expand_uri() 以处理两种格式：

# Try exact match first
if value in ontology_subset.classes:
    return ontology_subset.classes[value]['uri']

# Try with prefix
for prefix in ['fo/', 'rdf:', 'rdfs:']:
    prefixed = f"{prefix}{value}"
    if prefixed in ontology_subset.classes:
        return ontology_subset.classes[prefixed]['uri']

优点： 更清晰，更易于人类阅读与现有的提示示例相符 LLM（大型语言模型）在处理更简单的token时效果更好

缺点： 如果多个本体具有相同的类名，则可能发生类名冲突丢失命名空间信息需要回退逻辑来执行查找

选项 B：始终使用完整的带前缀的 ID

方法： 更新示例，使其使用与类列表中显示的前缀 ID 匹配。

更改：

更新提示示例（ontology-prompt.md:46-52）：

[
  {"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "fo/Recipe"},
  {"subject": "recipe:cornish-pasty", "predicate": "rdfs:label", "object": "Cornish Pasty"},
  {"subject": "recipe:cornish-pasty", "predicate": "fo/produces", "object": "food:cornish-pasty"},
  {"subject": "food:cornish-pasty", "predicate": "rdf:type", "object": "fo/Food"}
]

在提示语中添加命名空间说明：

## Namespace Prefixes:
- **fo/**: Food Ontology (http://purl.org/ontology/fo/)
- **rdf:**: RDF Schema
- **rdfs:**: RDF Schema

Use these prefixes exactly as shown when referencing classes and properties.

保持 expand_uri() 的原样 (当找到匹配项时，它可以正常工作)。

优点： 输入 = 输出的一致性。没有信息损失。保留命名空间语义。适用于多个本体。

缺点： 对于 LLM 来说，token 更加冗长。需要 LLM 跟踪前缀。

选项 C：混合 - 同时显示标签和 ID

方法： 增强提示，同时显示人类可读的标签和技术 ID。

更改：

更新提示模板：

{% for class_id, class_def in classes.items() %}
- **{{class_id}}** (label: "{{class_def.labels[0].value if class_def.labels else class_id}}"){% if class_def.comment %}: {{class_def.comment}}{% endif %}
{% endfor %}

示例输出：

- **fo/Recipe** (label: "Recipe"): A Recipe is a combination...

更新说明：

When referencing classes:
- Use the full prefixed ID (e.g., "fo/Recipe") in JSON output
- The label (e.g., "Recipe") is for human understanding only

优点 (Pros): 对 LLM 最清晰保留所有信息明确说明应该使用什么

缺点 (Cons): 提示更长模板更复杂

实施方法 (Implemented Approach)

简化的实体-关系-属性格式 - 完全取代了旧的三元组格式。

选择了这种新方法的原因是：

无信息损失 (No Information Loss): 原始 URI 正确保留
更简单的逻辑 (Simpler Logic): 无需转换，可以直接使用字典查找
命名空间安全 (Namespace Safety): 能够处理多个本体而不会发生冲突
语义正确性 (Semantic Correctness): 保持 RDF/OWL 语义

实施完成 (Implementation Complete)

构建内容 (What Was Built):

新的提示模板 (New Prompt Template) (prompts/ontology-extract-v2.txt) ✅ 清晰的章节：实体类型、关系、属性 ✅ 使用完整类型标识符的示例 (fo/Recipe, fo/has_ingredient) ✅ 指示使用模式中确切的标识符 ✅ 新的 JSON 格式，包含实体/关系/属性数组
实体规范化 (Entity Normalization) (entity_normalizer.py) ✅ normalize_entity_name() - 将名称转换为 URI 安全格式 ✅ normalize_type_identifier() - 处理类型中的斜杠 (fo/Recipe → fo-recipe) ✅ build_entity_uri() - 使用 (名称, 类型) 元组创建唯一 URI ✅ EntityRegistry - 跟踪实体以进行去重
JSON 解析器 (JSON Parser) (simplified_parser.py) ✅ 解析新的格式：{entities: [...], relationships: [...], attributes: [...]} ✅ 支持 kebab-case 和 snake_case 字段名称 ✅ 返回结构化的数据类 ✅ 具有优雅的错误处理和日志记录
三元组转换器 (Triple Converter) (triple_converter.py) ✅ convert_entity() - 自动生成类型 + 标签三元组 ✅ convert_relationship() - 通过属性连接实体 URI ✅ convert_attribute() - 添加字面值 ✅ 从本体定义中查找完整的 URI
更新的主要处理器 (Updated Main Processor) (extract.py) ✅ 删除了旧的三元组提取代码 ✅ 添加了 extract_with_simplified_format() 方法 ✅ 现在仅使用新的简化格式 ✅ 使用 extract-with-ontologies-v2 ID 调用提示

测试用例 (Test Cases)

测试 1: URI 保留 (Test 1: URI Preservation)

# Given ontology class
classes = {"fo/Recipe": {"uri": "http://purl.org/ontology/fo/Recipe", ...}}

# When LLM returns
llm_output = {"subject": "x", "predicate": "rdf:type", "object": "fo/Recipe"}

# Then expanded URI should be
assert expanded == "http://purl.org/ontology/fo/Recipe"
# Not: "https://trustgraph.ai/ontology/food#Recipe"

测试 2：多本体冲突

# Given two ontologies
ont1 = {"fo/Recipe": {...}}
ont2 = {"cooking/Recipe": {...}}

# LLM should use full prefix to disambiguate
llm_output = {"object": "fo/Recipe"}  # Not just "Recipe"

测试 3：实体实例格式

# Given prompt with food ontology
# LLM should create instances like
{"subject": "recipe:cornish-pasty"}  # Namespace-style
{"subject": "food:beef"}              # Consistent prefix

待解决的问题

实体实例是否应该使用命名空间前缀？ 当前："recipe:cornish-pasty" (任意) 替代方案：使用本体前缀 "fo:cornish-pasty"？替代方案：不使用前缀，在 URI 中展开 "cornish-pasty" → 完整 URI？
如何在提示中处理域/范围？ 当前显示：(Recipe → Food) 应该是：(fo/Recipe → fo/Food)？
是否应该验证域/范围约束？ TODO 注释在 extract.py:470 可以捕获更多错误，但更复杂
关于反向属性和等价性？ 本体有 owl:inverseOf，owl:equivalentClass 当前未在提取中使用应该使用吗？

成功指标

✅ 零 URI 信息损失（100% 保留原始 URI） ✅ LLM 输出格式与输入格式匹配 ✅ 提示中没有歧义的示例 ✅ 使用多个本体的测试通过 ✅ 提取质量得到改进（通过有效的三元组百分比衡量）

替代方法：简化的提取格式

理念

不要让 LLM 理解 RDF/OWL 语义，而是让它做擅长的事情：在文本中查找实体和关系。

让代码处理 URI 构造、RDF 转换和语义网规范。

示例：实体分类

输入文本：

Cornish pasty is a traditional British pastry filled with meat and vegetables.

本体模式（显示给LLM）：

## Entity Types:
- Recipe: A recipe is a combination of ingredients and a method
- Food: A food is something that can be eaten
- Ingredient: An ingredient combines a quantity and a food

LLM 返回的内容（简单 JSON）：

{
  "entities": [
    {
      "entity": "Cornish pasty",
      "type": "Recipe"
    }
  ]
}

生成的代码（RDF 三元组）：

# 1. Normalize entity name + type to ID (type prevents collisions)
entity_id = "recipe-cornish-pasty"  # normalize("Cornish pasty", "Recipe")
entity_uri = "https://trustgraph.ai/food/recipe-cornish-pasty"

# Note: Same name, different type = different URI
# "Cornish pasty" (Recipe) → recipe-cornish-pasty
# "Cornish pasty" (Food) → food-cornish-pasty

# 2. Generate triples
triples = [
    # Type triple
    Triple(
        s=Value(value=entity_uri, is_uri=True),
        p=Value(value="http://www.w3.org/1999/02/22-rdf-syntax-ns#type", is_uri=True),
        o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True)
    ),
    # Label triple (automatic)
    Triple(
        s=Value(value=entity_uri, is_uri=True),
        p=Value(value="http://www.w3.org/2000/01/rdf-schema#label", is_uri=True),
        o=Value(value="Cornish pasty", is_uri=False)
    )
]

优点

LLM 不需要： 理解 URI 语法编造标识符前缀 (recipe:, ingredient:) 了解 rdf:type 或 rdfs:label 构建语义网标识符
LLM 只需要： 在文本中找到实体将它们映射到本体类提取关系和属性
代码负责： URI 规范化和构建 RDF 三元组生成自动标签分配命名空间管理

为什么这样更好

更简单的提示 = 减少困惑 = 更少的错误 一致的 ID = 代码控制规范化规则 自动生成的标签 = 没有遗漏的 rdfs:label 三元组 LLM 专注于提取 = 这是它擅长的事情

示例：实体关系

输入文本：

Cornish pasty is a traditional British pastry filled with beef and potatoes.

本体模式（显示给LLM）：

## Entity Types:
- Recipe: A recipe is a combination of ingredients and a method
- Food: A food is something that can be eaten
- Ingredient: An ingredient combines a quantity and a food

## Relationships:
- has_ingredient: Relates a recipe to an ingredient it uses (Recipe → Ingredient)
- food: Relates an ingredient to the food that is required (Ingredient → Food)

LLM 返回的内容（简单 JSON）：

{
  "entities": [
    {
      "entity": "Cornish pasty",
      "type": "Recipe"
    },
    {
      "entity": "beef",
      "type": "Food"
    },
    {
      "entity": "potatoes",
      "type": "Food"
    }
  ],
  "relationships": [
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "has_ingredient",
      "object": "beef",
      "object-type": "Food"
    },
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "has_ingredient",
      "object": "potatoes",
      "object-type": "Food"
    }
  ]
}

生成的代码（RDF 三元组）：

# Normalize entity names to URIs
cornish_pasty_uri = "https://trustgraph.ai/food/cornish-pasty"
beef_uri = "https://trustgraph.ai/food/beef"
potatoes_uri = "https://trustgraph.ai/food/potatoes"

# Look up relation URI from ontology
has_ingredient_uri = "http://purl.org/ontology/fo/ingredients"  # from fo/has_ingredient

triples = [
    # Entity type triples (as before)
    Triple(s=cornish_pasty_uri, p=rdf_type, o="http://purl.org/ontology/fo/Recipe"),
    Triple(s=cornish_pasty_uri, p=rdfs_label, o="Cornish pasty"),

    Triple(s=beef_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"),
    Triple(s=beef_uri, p=rdfs_label, o="beef"),

    Triple(s=potatoes_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"),
    Triple(s=potatoes_uri, p=rdfs_label, o="potatoes"),

    # Relationship triples
    Triple(
        s=Value(value=cornish_pasty_uri, is_uri=True),
        p=Value(value=has_ingredient_uri, is_uri=True),
        o=Value(value=beef_uri, is_uri=True)
    ),
    Triple(
        s=Value(value=cornish_pasty_uri, is_uri=True),
        p=Value(value=has_ingredient_uri, is_uri=True),
        o=Value(value=potatoes_uri, is_uri=True)
    )
]

关键点： LLM 返回自然语言实体名称："Cornish pasty", "beef", "potatoes" LLM 包含类型信息以消除歧义：subject-type, object-type LLM 使用模式中的关系名称："has_ingredient" 代码使用 (名称, 类型) 推导一致的 ID：("Cornish pasty", "Recipe") → recipe-cornish-pasty 代码从本体中查找关系 URI：fo/has_ingredient → 完整 URI 相同的 (名称, 类型) 元组始终获得相同的 URI (去重)

示例：实体名称消除歧义

问题： 相同的名称可能指不同的实体类型。

实际案例：

"Cornish pasty" can be:
- A Recipe (instructions for making it)
- A Food (the dish itself)

处理方式：

LLM 将两者都以独立的实体返回：

{
  "entities": [
    {"entity": "Cornish pasty", "type": "Recipe"},
    {"entity": "Cornish pasty", "type": "Food"}
  ],
  "relationships": [
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "produces",
      "object": "Cornish pasty",
      "object-type": "Food"
    }
  ]
}

代码解析：

# Different types → different URIs
recipe_uri = normalize("Cornish pasty", "Recipe")
# → "https://trustgraph.ai/food/recipe-cornish-pasty"

food_uri = normalize("Cornish pasty", "Food")
# → "https://trustgraph.ai/food/food-cornish-pasty"

# Relationship connects them correctly
triple = Triple(
    s=recipe_uri,  # The Recipe
    p="http://purl.org/ontology/fo/produces",
    o=food_uri     # The Food
)

原理： 类型包含在所有引用中（实体、关系、属性）。代码使用 (name, type) 元组作为查找键。没有歧义，没有冲突。

示例：实体属性

输入文本：

This Cornish pasty recipe serves 4-6 people and takes 45 minutes to prepare.

本体模式（显示给LLM）：

## Entity Types:
- Recipe: A recipe is a combination of ingredients and a method

## Attributes:
- serves: Indicates what the recipe is intended to serve (Recipe → text)
- preparation_time: Time needed to prepare the recipe (Recipe → text)

LLM 返回的内容（简单 JSON）：

{
  "entities": [
    {
      "entity": "Cornish pasty recipe",
      "type": "Recipe"
    }
  ],
  "attributes": [
    {
      "entity": "Cornish pasty recipe",
      "entity-type": "Recipe",
      "attribute": "serves",
      "value": "4-6 people"
    },
    {
      "entity": "Cornish pasty recipe",
      "entity-type": "Recipe",
      "attribute": "preparation_time",
      "value": "45 minutes"
    }
  ]
}

生成的代码（RDF 三元组）：

# Normalize entity name to URI
recipe_uri = "https://trustgraph.ai/food/cornish-pasty-recipe"

# Look up attribute URIs from ontology
serves_uri = "http://purl.org/ontology/fo/serves"  # from fo/serves
prep_time_uri = "http://purl.org/ontology/fo/preparation_time"  # from fo/preparation_time

triples = [
    # Entity type triple
    Triple(
        s=Value(value=recipe_uri, is_uri=True),
        p=Value(value=rdf_type, is_uri=True),
        o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True)
    ),

    # Label triple (automatic)
    Triple(
        s=Value(value=recipe_uri, is_uri=True),
        p=Value(value=rdfs_label, is_uri=True),
        o=Value(value="Cornish pasty recipe", is_uri=False)
    ),

    # Attribute triples (objects are literals, not URIs)
    Triple(
        s=Value(value=recipe_uri, is_uri=True),
        p=Value(value=serves_uri, is_uri=True),
        o=Value(value="4-6 people", is_uri=False)  # Literal value!
    ),
    Triple(
        s=Value(value=recipe_uri, is_uri=True),
        p=Value(value=prep_time_uri, is_uri=True),
        o=Value(value="45 minutes", is_uri=False)  # Literal value!
    )
]

关键点： LLM 提取字面值："4-6 people", "45 minutes" LLM 包含实体类型以消除歧义：entity-type LLM 使用来自模式的属性名称："serves", "preparation_time" 代码从本体数据类型属性中查找属性 URI 对象是字面值 (is_uri=False)，而不是 URI 引用值保持为自然文本，无需进行任何标准化

与关系的差异： 关系：主语和宾语都是实体（URI）属性：主语是实体（URI），宾语是字面值（字符串/数字）

完整示例：实体 + 关系 + 属性

输入文本：

Cornish pasty is a savory pastry filled with beef and potatoes.
This recipe serves 4 people.

LLM 返回的内容：

{
  "entities": [
    {
      "entity": "Cornish pasty",
      "type": "Recipe"
    },
    {
      "entity": "beef",
      "type": "Food"
    },
    {
      "entity": "potatoes",
      "type": "Food"
    }
  ],
  "relationships": [
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "has_ingredient",
      "object": "beef",
      "object-type": "Food"
    },
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "has_ingredient",
      "object": "potatoes",
      "object-type": "Food"
    }
  ],
  "attributes": [
    {
      "entity": "Cornish pasty",
      "entity-type": "Recipe",
      "attribute": "serves",
      "value": "4 people"
    }
  ]
}

结果： 生成了 11 个 RDF 三元组： 3 个实体类型三元组 (rdf:type) 3 个实体标签三元组 (rdfs:label) - 自动 2 个关系三元组 (has_ingredient) 1 个属性三元组 (serves)

所有内容均由 LLM 通过简单的、自然的语言提取得出！

参考文献

当前实现：trustgraph-flow/trustgraph/extract/kg/ontology/extract.py 提示模板：ontology-prompt.md 测试用例：tests/unit/test_extract/test_ontology/ 示例本体：e2e/test-data/food.ontology

22 KiB Raw Blame History Unescape Escape

本体知识抽取 - 第二阶段重构

概述

当前实现

当前工作方式

数据流示例

发现的问题

1. 提示语中的示例不一致

2. URI 扩展中的信息丢失

3. 实体实例格式不明确 (Ambiguous Entity Instance Format)

4. 没有命名空间前缀的指导

5. 未在提示中使用标签

提出的解决方案

方案 A：标准化为不带前缀的 ID

选项 B：始终使用完整的带前缀的 ID

选项 C：混合 - 同时显示标签和 ID

实施方法 (Implemented Approach)

实施完成 (Implementation Complete)

构建内容 (What Was Built):

测试用例 (Test Cases)

测试 1: URI 保留 (Test 1: URI Preservation)

测试 2：多本体冲突

测试 3：实体实例格式

待解决的问题

成功指标

替代方法：简化的提取格式

理念

示例：实体分类

优点

为什么这样更好

示例：实体关系

示例：实体名称消除歧义

示例：实体属性

完整示例：实体 + 关系 + 属性

参考文献

22 KiB

Raw Blame History