mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 08:26:21 +02:00
Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.
769 lines
22 KiB
Markdown
769 lines
22 KiB
Markdown
---
|
||
layout: default
|
||
title: "本体知识抽取 - 第二阶段重构"
|
||
parent: "Chinese (Beta)"
|
||
---
|
||
|
||
# 本体知识抽取 - 第二阶段重构
|
||
|
||
> **Beta Translation:** This document was translated via Machine Learning and as such may not be 100% accurate. All non-English languages are currently classified as Beta.
|
||
|
||
**状态**: 草稿
|
||
**作者**: 分析会议 2025-12-03
|
||
**相关**: `ontology.md`, `ontorag.md`
|
||
|
||
## 概述
|
||
|
||
本文档识别了当前基于本体的知识抽取系统中存在的缺陷,并提出了重构方案,以提高 LLM 的性能并减少信息损失。
|
||
|
||
## 当前实现
|
||
|
||
### 当前工作方式
|
||
|
||
1. **本体加载** (`ontology_loader.py`)
|
||
加载本体 JSON 文件,其中包含键,例如 `"fo/Recipe"`, `"fo/Food"`, `"fo/produces"`
|
||
类 ID 在键本身中包含命名空间前缀
|
||
示例来自 `food.ontology`:
|
||
```json
|
||
"classes": {
|
||
"fo/Recipe": {
|
||
"uri": "http://purl.org/ontology/fo/Recipe",
|
||
"rdfs:comment": "A Recipe is a combination..."
|
||
}
|
||
}
|
||
```
|
||
|
||
2. **提示构建** (`extract.py:299-307`, `ontology-prompt.md`)
|
||
模板接收 `classes`, `object_properties`, `datatype_properties` 字典
|
||
模板循环:`{% for class_id, class_def in classes.items() %}`
|
||
LLM 看到:`**fo/Recipe**: A Recipe is a combination...`
|
||
示例输出格式如下:
|
||
```json
|
||
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"}
|
||
{"subject": "recipe:cornish-pasty", "predicate": "has_ingredient", "object": "ingredient:flour"}
|
||
```
|
||
|
||
3. **响应解析** (`extract.py:382-428`)
|
||
期望接收 JSON 数组: `[{"subject": "...", "predicate": "...", "object": "..."}]`
|
||
验证是否符合本体子集
|
||
通过 `expand_uri()` 扩展 URI (extract.py:473-521)
|
||
|
||
4. **URI 扩展** (`extract.py:473-521`)
|
||
检查值是否在 `ontology_subset.classes` 字典中
|
||
如果找到,从类定义中提取 URI
|
||
如果未找到,则构建 URI: `f"https://trustgraph.ai/ontology/{ontology_id}#{value}"`
|
||
|
||
### 数据流示例
|
||
|
||
**本体 JSON → 加载器 → 提示:**
|
||
```
|
||
"fo/Recipe" → classes["fo/Recipe"] → LLM sees "**fo/Recipe**"
|
||
```
|
||
|
||
**LLM → 解析器 → 输出:**
|
||
```
|
||
"Recipe" → not in classes["fo/Recipe"] → constructs URI → LOSES original URI
|
||
"fo/Recipe" → found in classes → uses original URI → PRESERVES URI
|
||
```
|
||
|
||
## 发现的问题
|
||
|
||
### 1. **提示语中的示例不一致**
|
||
|
||
**问题**: 提示模板显示带有前缀的类ID (`fo/Recipe`),但示例输出使用不带前缀的类名 (`Recipe`)。
|
||
|
||
**位置**: `ontology-prompt.md:5-52`
|
||
|
||
```markdown
|
||
## Ontology Classes:
|
||
- **fo/Recipe**: A Recipe is...
|
||
|
||
## Example Output:
|
||
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"}
|
||
```
|
||
|
||
**影响:** LLM 接收到关于应该使用哪种格式的冲突信号。
|
||
|
||
### 2. **URI 扩展中的信息丢失**
|
||
|
||
**问题:** 当 LLM 返回不带前缀的类名,例如示例中的情况时,`expand_uri()` 无法在本体字典中找到它们,而是构造了备用 URI,从而丢失了原始的正确 URI。
|
||
|
||
**位置:** `extract.py:494-500`
|
||
|
||
```python
|
||
if value in ontology_subset.classes: # Looks for "Recipe"
|
||
class_def = ontology_subset.classes[value] # But key is "fo/Recipe"
|
||
if isinstance(class_def, dict) and 'uri' in class_def:
|
||
return class_def['uri'] # Never reached!
|
||
return f"https://trustgraph.ai/ontology/{ontology_id}#{value}" # Fallback
|
||
```
|
||
|
||
**影响 (Impact):**
|
||
原始 URI: `http://purl.org/ontology/fo/Recipe`
|
||
构建的 URI: `https://trustgraph.ai/ontology/food#Recipe`
|
||
语义信息丢失,破坏互操作性
|
||
|
||
### 3. **实体实例格式不明确 (Ambiguous Entity Instance Format)**
|
||
|
||
**问题 (Issue):** 没有关于实体实例 URI 格式的明确指导。
|
||
|
||
**提示中的示例 (Examples in prompt):**
|
||
`"recipe:cornish-pasty"` (类似于命名空间的 前缀)
|
||
`"ingredient:flour"` (不同的前缀)
|
||
|
||
**实际行为 (Actual behavior) (extract.py:517-520):**
|
||
```python
|
||
# Treat as entity instance - construct unique URI
|
||
normalized = value.replace(" ", "-").lower()
|
||
return f"https://trustgraph.ai/{ontology_id}/{normalized}"
|
||
```
|
||
|
||
**影响:** LLM 必须在没有任何本体知识的情况下猜测前缀约定。
|
||
|
||
### 4. **没有命名空间前缀的指导**
|
||
|
||
**问题:** 本体 JSON 包含命名空间定义(food.ontology 中的第 10-25 行):
|
||
```json
|
||
"namespaces": {
|
||
"fo": "http://purl.org/ontology/fo/",
|
||
"rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
|
||
...
|
||
}
|
||
```
|
||
|
||
但是这些内容永远不会传递给大型语言模型。大型语言模型不知道:
|
||
"fo" 的含义
|
||
应使用哪个前缀来表示实体
|
||
哪个命名空间适用于哪些元素
|
||
|
||
### 5. **未在提示中使用标签**
|
||
|
||
**问题:** 每一个类都有 `rdfs:label` 字段(例如,`{"value": "Recipe", "lang": "en-gb"}`),但提示模板没有使用它们。
|
||
|
||
**当前:** 只显示 `class_id` 和 `comment`
|
||
```jinja
|
||
- **{{class_id}}**{% if class_def.comment %}: {{class_def.comment}}{% endif %}
|
||
```
|
||
|
||
**可用但未使用的:**
|
||
```python
|
||
"rdfs:label": [{"value": "Recipe", "lang": "en-gb"}]
|
||
```
|
||
|
||
**影响:** 可以为技术 ID 旁边提供人类可读的名称。
|
||
|
||
## 提出的解决方案
|
||
|
||
### 方案 A:标准化为不带前缀的 ID
|
||
|
||
**方法:** 在向 LLM 显示之前,从类 ID 中移除前缀。
|
||
|
||
**变更:**
|
||
1. 修改 `build_extraction_variables()` 以转换键:
|
||
```python
|
||
classes_for_prompt = {
|
||
k.split('/')[-1]: v # "fo/Recipe" → "Recipe"
|
||
for k, v in ontology_subset.classes.items()
|
||
}
|
||
```
|
||
|
||
2. 将提示示例更新为匹配项(已使用未加前缀的名称)。
|
||
|
||
3. 修改 `expand_uri()` 以处理两种格式:
|
||
```python
|
||
# Try exact match first
|
||
if value in ontology_subset.classes:
|
||
return ontology_subset.classes[value]['uri']
|
||
|
||
# Try with prefix
|
||
for prefix in ['fo/', 'rdf:', 'rdfs:']:
|
||
prefixed = f"{prefix}{value}"
|
||
if prefixed in ontology_subset.classes:
|
||
return ontology_subset.classes[prefixed]['uri']
|
||
```
|
||
|
||
**优点:**
|
||
更清晰,更易于人类阅读
|
||
与现有的提示示例相符
|
||
LLM(大型语言模型)在处理更简单的token时效果更好
|
||
|
||
**缺点:**
|
||
如果多个本体具有相同的类名,则可能发生类名冲突
|
||
丢失命名空间信息
|
||
需要回退逻辑来执行查找
|
||
|
||
### 选项 B:始终使用完整的带前缀的 ID
|
||
|
||
**方法:** 更新示例,使其使用与类列表中显示的前缀 ID 匹配。
|
||
|
||
**更改:**
|
||
1. 更新提示示例(ontology-prompt.md:46-52):
|
||
```json
|
||
[
|
||
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "fo/Recipe"},
|
||
{"subject": "recipe:cornish-pasty", "predicate": "rdfs:label", "object": "Cornish Pasty"},
|
||
{"subject": "recipe:cornish-pasty", "predicate": "fo/produces", "object": "food:cornish-pasty"},
|
||
{"subject": "food:cornish-pasty", "predicate": "rdf:type", "object": "fo/Food"}
|
||
]
|
||
```
|
||
|
||
2. 在提示语中添加命名空间说明:
|
||
```markdown
|
||
## Namespace Prefixes:
|
||
- **fo/**: Food Ontology (http://purl.org/ontology/fo/)
|
||
- **rdf:**: RDF Schema
|
||
- **rdfs:**: RDF Schema
|
||
|
||
Use these prefixes exactly as shown when referencing classes and properties.
|
||
```
|
||
|
||
3. 保持 `expand_uri()` 的原样 (当找到匹配项时,它可以正常工作)。
|
||
|
||
**优点:**
|
||
输入 = 输出的一致性。
|
||
没有信息损失。
|
||
保留命名空间语义。
|
||
适用于多个本体。
|
||
|
||
**缺点:**
|
||
对于 LLM 来说,token 更加冗长。
|
||
需要 LLM 跟踪前缀。
|
||
|
||
### 选项 C:混合 - 同时显示标签和 ID
|
||
|
||
**方法:** 增强提示,同时显示人类可读的标签和技术 ID。
|
||
|
||
**更改:**
|
||
1. 更新提示模板:
|
||
```jinja
|
||
{% for class_id, class_def in classes.items() %}
|
||
- **{{class_id}}** (label: "{{class_def.labels[0].value if class_def.labels else class_id}}"){% if class_def.comment %}: {{class_def.comment}}{% endif %}
|
||
{% endfor %}
|
||
```
|
||
|
||
示例输出:
|
||
```markdown
|
||
- **fo/Recipe** (label: "Recipe"): A Recipe is a combination...
|
||
```
|
||
|
||
2. 更新说明:
|
||
```markdown
|
||
When referencing classes:
|
||
- Use the full prefixed ID (e.g., "fo/Recipe") in JSON output
|
||
- The label (e.g., "Recipe") is for human understanding only
|
||
```
|
||
|
||
**优点 (Pros)**:
|
||
对 LLM 最清晰
|
||
保留所有信息
|
||
明确说明应该使用什么
|
||
|
||
**缺点 (Cons)**:
|
||
提示更长
|
||
模板更复杂
|
||
|
||
## 实施方法 (Implemented Approach)
|
||
|
||
**简化的实体-关系-属性格式** - 完全取代了旧的三元组格式。
|
||
|
||
选择了这种新方法的原因是:
|
||
|
||
1. **无信息损失 (No Information Loss)**: 原始 URI 正确保留
|
||
2. **更简单的逻辑 (Simpler Logic)**: 无需转换,可以直接使用字典查找
|
||
3. **命名空间安全 (Namespace Safety)**: 能够处理多个本体而不会发生冲突
|
||
4. **语义正确性 (Semantic Correctness)**: 保持 RDF/OWL 语义
|
||
|
||
## 实施完成 (Implementation Complete)
|
||
|
||
### 构建内容 (What Was Built):
|
||
|
||
1. **新的提示模板 (New Prompt Template)** (`prompts/ontology-extract-v2.txt`)
|
||
✅ 清晰的章节:实体类型、关系、属性
|
||
✅ 使用完整类型标识符的示例 (`fo/Recipe`, `fo/has_ingredient`)
|
||
✅ 指示使用模式中确切的标识符
|
||
✅ 新的 JSON 格式,包含实体/关系/属性数组
|
||
|
||
2. **实体规范化 (Entity Normalization)** (`entity_normalizer.py`)
|
||
✅ `normalize_entity_name()` - 将名称转换为 URI 安全格式
|
||
✅ `normalize_type_identifier()` - 处理类型中的斜杠 (`fo/Recipe` → `fo-recipe`)
|
||
✅ `build_entity_uri()` - 使用 (名称, 类型) 元组创建唯一 URI
|
||
✅ `EntityRegistry` - 跟踪实体以进行去重
|
||
|
||
3. **JSON 解析器 (JSON Parser)** (`simplified_parser.py`)
|
||
✅ 解析新的格式:`{entities: [...], relationships: [...], attributes: [...]}`
|
||
✅ 支持 kebab-case 和 snake_case 字段名称
|
||
✅ 返回结构化的数据类
|
||
✅ 具有优雅的错误处理和日志记录
|
||
|
||
4. **三元组转换器 (Triple Converter)** (`triple_converter.py`)
|
||
✅ `convert_entity()` - 自动生成类型 + 标签三元组
|
||
✅ `convert_relationship()` - 通过属性连接实体 URI
|
||
✅ `convert_attribute()` - 添加字面值
|
||
✅ 从本体定义中查找完整的 URI
|
||
|
||
5. **更新的主要处理器 (Updated Main Processor)** (`extract.py`)
|
||
✅ 删除了旧的三元组提取代码
|
||
✅ 添加了 `extract_with_simplified_format()` 方法
|
||
✅ 现在仅使用新的简化格式
|
||
✅ 使用 `extract-with-ontologies-v2` ID 调用提示
|
||
|
||
## 测试用例 (Test Cases)
|
||
|
||
### 测试 1: URI 保留 (Test 1: URI Preservation)
|
||
```python
|
||
# Given ontology class
|
||
classes = {"fo/Recipe": {"uri": "http://purl.org/ontology/fo/Recipe", ...}}
|
||
|
||
# When LLM returns
|
||
llm_output = {"subject": "x", "predicate": "rdf:type", "object": "fo/Recipe"}
|
||
|
||
# Then expanded URI should be
|
||
assert expanded == "http://purl.org/ontology/fo/Recipe"
|
||
# Not: "https://trustgraph.ai/ontology/food#Recipe"
|
||
```
|
||
|
||
### 测试 2:多本体冲突
|
||
```python
|
||
# Given two ontologies
|
||
ont1 = {"fo/Recipe": {...}}
|
||
ont2 = {"cooking/Recipe": {...}}
|
||
|
||
# LLM should use full prefix to disambiguate
|
||
llm_output = {"object": "fo/Recipe"} # Not just "Recipe"
|
||
```
|
||
|
||
### 测试 3:实体实例格式
|
||
```python
|
||
# Given prompt with food ontology
|
||
# LLM should create instances like
|
||
{"subject": "recipe:cornish-pasty"} # Namespace-style
|
||
{"subject": "food:beef"} # Consistent prefix
|
||
```
|
||
|
||
## 待解决的问题
|
||
|
||
1. **实体实例是否应该使用命名空间前缀?**
|
||
当前:`"recipe:cornish-pasty"` (任意)
|
||
替代方案:使用本体前缀 `"fo:cornish-pasty"`?
|
||
替代方案:不使用前缀,在 URI 中展开 `"cornish-pasty"` → 完整 URI?
|
||
|
||
2. **如何在提示中处理域/范围?**
|
||
当前显示:`(Recipe → Food)`
|
||
应该是:`(fo/Recipe → fo/Food)`?
|
||
|
||
3. **是否应该验证域/范围约束?**
|
||
TODO 注释在 extract.py:470
|
||
可以捕获更多错误,但更复杂
|
||
|
||
4. **关于反向属性和等价性?**
|
||
本体有 `owl:inverseOf`,`owl:equivalentClass`
|
||
当前未在提取中使用
|
||
应该使用吗?
|
||
|
||
## 成功指标
|
||
|
||
✅ 零 URI 信息损失(100% 保留原始 URI)
|
||
✅ LLM 输出格式与输入格式匹配
|
||
✅ 提示中没有歧义的示例
|
||
✅ 使用多个本体的测试通过
|
||
✅ 提取质量得到改进(通过有效的三元组百分比衡量)
|
||
|
||
## 替代方法:简化的提取格式
|
||
|
||
### 理念
|
||
|
||
不要让 LLM 理解 RDF/OWL 语义,而是让它做擅长的事情:**在文本中查找实体和关系**。
|
||
|
||
让代码处理 URI 构造、RDF 转换和语义网规范。
|
||
|
||
### 示例:实体分类
|
||
|
||
**输入文本:**
|
||
```
|
||
Cornish pasty is a traditional British pastry filled with meat and vegetables.
|
||
```
|
||
|
||
**本体模式(显示给LLM):**
|
||
```markdown
|
||
## Entity Types:
|
||
- Recipe: A recipe is a combination of ingredients and a method
|
||
- Food: A food is something that can be eaten
|
||
- Ingredient: An ingredient combines a quantity and a food
|
||
```
|
||
|
||
**LLM 返回的内容(简单 JSON):**
|
||
```json
|
||
{
|
||
"entities": [
|
||
{
|
||
"entity": "Cornish pasty",
|
||
"type": "Recipe"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
**生成的代码(RDF 三元组):**
|
||
```python
|
||
# 1. Normalize entity name + type to ID (type prevents collisions)
|
||
entity_id = "recipe-cornish-pasty" # normalize("Cornish pasty", "Recipe")
|
||
entity_uri = "https://trustgraph.ai/food/recipe-cornish-pasty"
|
||
|
||
# Note: Same name, different type = different URI
|
||
# "Cornish pasty" (Recipe) → recipe-cornish-pasty
|
||
# "Cornish pasty" (Food) → food-cornish-pasty
|
||
|
||
# 2. Generate triples
|
||
triples = [
|
||
# Type triple
|
||
Triple(
|
||
s=Value(value=entity_uri, is_uri=True),
|
||
p=Value(value="http://www.w3.org/1999/02/22-rdf-syntax-ns#type", is_uri=True),
|
||
o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True)
|
||
),
|
||
# Label triple (automatic)
|
||
Triple(
|
||
s=Value(value=entity_uri, is_uri=True),
|
||
p=Value(value="http://www.w3.org/2000/01/rdf-schema#label", is_uri=True),
|
||
o=Value(value="Cornish pasty", is_uri=False)
|
||
)
|
||
]
|
||
```
|
||
|
||
### 优点
|
||
|
||
1. **LLM 不需要:**
|
||
理解 URI 语法
|
||
编造标识符前缀 (`recipe:`, `ingredient:`)
|
||
了解 `rdf:type` 或 `rdfs:label`
|
||
构建语义网标识符
|
||
|
||
2. **LLM 只需要:**
|
||
在文本中找到实体
|
||
将它们映射到本体类
|
||
提取关系和属性
|
||
|
||
3. **代码负责:**
|
||
URI 规范化和构建
|
||
RDF 三元组生成
|
||
自动标签分配
|
||
命名空间管理
|
||
|
||
### 为什么这样更好
|
||
|
||
**更简单的提示** = 减少困惑 = 更少的错误
|
||
**一致的 ID** = 代码控制规范化规则
|
||
**自动生成的标签** = 没有遗漏的 rdfs:label 三元组
|
||
**LLM 专注于提取** = 这是它擅长的事情
|
||
|
||
### 示例:实体关系
|
||
|
||
**输入文本:**
|
||
```
|
||
Cornish pasty is a traditional British pastry filled with beef and potatoes.
|
||
```
|
||
|
||
**本体模式(显示给LLM):**
|
||
```markdown
|
||
## Entity Types:
|
||
- Recipe: A recipe is a combination of ingredients and a method
|
||
- Food: A food is something that can be eaten
|
||
- Ingredient: An ingredient combines a quantity and a food
|
||
|
||
## Relationships:
|
||
- has_ingredient: Relates a recipe to an ingredient it uses (Recipe → Ingredient)
|
||
- food: Relates an ingredient to the food that is required (Ingredient → Food)
|
||
```
|
||
|
||
**LLM 返回的内容(简单 JSON):**
|
||
```json
|
||
{
|
||
"entities": [
|
||
{
|
||
"entity": "Cornish pasty",
|
||
"type": "Recipe"
|
||
},
|
||
{
|
||
"entity": "beef",
|
||
"type": "Food"
|
||
},
|
||
{
|
||
"entity": "potatoes",
|
||
"type": "Food"
|
||
}
|
||
],
|
||
"relationships": [
|
||
{
|
||
"subject": "Cornish pasty",
|
||
"subject-type": "Recipe",
|
||
"relation": "has_ingredient",
|
||
"object": "beef",
|
||
"object-type": "Food"
|
||
},
|
||
{
|
||
"subject": "Cornish pasty",
|
||
"subject-type": "Recipe",
|
||
"relation": "has_ingredient",
|
||
"object": "potatoes",
|
||
"object-type": "Food"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
**生成的代码(RDF 三元组):**
|
||
```python
|
||
# Normalize entity names to URIs
|
||
cornish_pasty_uri = "https://trustgraph.ai/food/cornish-pasty"
|
||
beef_uri = "https://trustgraph.ai/food/beef"
|
||
potatoes_uri = "https://trustgraph.ai/food/potatoes"
|
||
|
||
# Look up relation URI from ontology
|
||
has_ingredient_uri = "http://purl.org/ontology/fo/ingredients" # from fo/has_ingredient
|
||
|
||
triples = [
|
||
# Entity type triples (as before)
|
||
Triple(s=cornish_pasty_uri, p=rdf_type, o="http://purl.org/ontology/fo/Recipe"),
|
||
Triple(s=cornish_pasty_uri, p=rdfs_label, o="Cornish pasty"),
|
||
|
||
Triple(s=beef_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"),
|
||
Triple(s=beef_uri, p=rdfs_label, o="beef"),
|
||
|
||
Triple(s=potatoes_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"),
|
||
Triple(s=potatoes_uri, p=rdfs_label, o="potatoes"),
|
||
|
||
# Relationship triples
|
||
Triple(
|
||
s=Value(value=cornish_pasty_uri, is_uri=True),
|
||
p=Value(value=has_ingredient_uri, is_uri=True),
|
||
o=Value(value=beef_uri, is_uri=True)
|
||
),
|
||
Triple(
|
||
s=Value(value=cornish_pasty_uri, is_uri=True),
|
||
p=Value(value=has_ingredient_uri, is_uri=True),
|
||
o=Value(value=potatoes_uri, is_uri=True)
|
||
)
|
||
]
|
||
```
|
||
|
||
**关键点:**
|
||
LLM 返回自然语言实体名称:`"Cornish pasty"`, `"beef"`, `"potatoes"`
|
||
LLM 包含类型信息以消除歧义:`subject-type`, `object-type`
|
||
LLM 使用模式中的关系名称:`"has_ingredient"`
|
||
代码使用 (名称, 类型) 推导一致的 ID:`("Cornish pasty", "Recipe")` → `recipe-cornish-pasty`
|
||
代码从本体中查找关系 URI:`fo/has_ingredient` → 完整 URI
|
||
相同的 (名称, 类型) 元组始终获得相同的 URI (去重)
|
||
|
||
### 示例:实体名称消除歧义
|
||
|
||
**问题:** 相同的名称可能指不同的实体类型。
|
||
|
||
**实际案例:**
|
||
```
|
||
"Cornish pasty" can be:
|
||
- A Recipe (instructions for making it)
|
||
- A Food (the dish itself)
|
||
```
|
||
|
||
**处理方式:**
|
||
|
||
LLM 将两者都以独立的实体返回:
|
||
```json
|
||
{
|
||
"entities": [
|
||
{"entity": "Cornish pasty", "type": "Recipe"},
|
||
{"entity": "Cornish pasty", "type": "Food"}
|
||
],
|
||
"relationships": [
|
||
{
|
||
"subject": "Cornish pasty",
|
||
"subject-type": "Recipe",
|
||
"relation": "produces",
|
||
"object": "Cornish pasty",
|
||
"object-type": "Food"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
**代码解析:**
|
||
```python
|
||
# Different types → different URIs
|
||
recipe_uri = normalize("Cornish pasty", "Recipe")
|
||
# → "https://trustgraph.ai/food/recipe-cornish-pasty"
|
||
|
||
food_uri = normalize("Cornish pasty", "Food")
|
||
# → "https://trustgraph.ai/food/food-cornish-pasty"
|
||
|
||
# Relationship connects them correctly
|
||
triple = Triple(
|
||
s=recipe_uri, # The Recipe
|
||
p="http://purl.org/ontology/fo/produces",
|
||
o=food_uri # The Food
|
||
)
|
||
```
|
||
|
||
**原理:**
|
||
类型包含在所有引用中(实体、关系、属性)。
|
||
代码使用 `(name, type)` 元组作为查找键。
|
||
没有歧义,没有冲突。
|
||
|
||
### 示例:实体属性
|
||
|
||
**输入文本:**
|
||
```
|
||
This Cornish pasty recipe serves 4-6 people and takes 45 minutes to prepare.
|
||
```
|
||
|
||
**本体模式(显示给LLM):**
|
||
```markdown
|
||
## Entity Types:
|
||
- Recipe: A recipe is a combination of ingredients and a method
|
||
|
||
## Attributes:
|
||
- serves: Indicates what the recipe is intended to serve (Recipe → text)
|
||
- preparation_time: Time needed to prepare the recipe (Recipe → text)
|
||
```
|
||
|
||
**LLM 返回的内容(简单 JSON):**
|
||
```json
|
||
{
|
||
"entities": [
|
||
{
|
||
"entity": "Cornish pasty recipe",
|
||
"type": "Recipe"
|
||
}
|
||
],
|
||
"attributes": [
|
||
{
|
||
"entity": "Cornish pasty recipe",
|
||
"entity-type": "Recipe",
|
||
"attribute": "serves",
|
||
"value": "4-6 people"
|
||
},
|
||
{
|
||
"entity": "Cornish pasty recipe",
|
||
"entity-type": "Recipe",
|
||
"attribute": "preparation_time",
|
||
"value": "45 minutes"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
**生成的代码(RDF 三元组):**
|
||
```python
|
||
# Normalize entity name to URI
|
||
recipe_uri = "https://trustgraph.ai/food/cornish-pasty-recipe"
|
||
|
||
# Look up attribute URIs from ontology
|
||
serves_uri = "http://purl.org/ontology/fo/serves" # from fo/serves
|
||
prep_time_uri = "http://purl.org/ontology/fo/preparation_time" # from fo/preparation_time
|
||
|
||
triples = [
|
||
# Entity type triple
|
||
Triple(
|
||
s=Value(value=recipe_uri, is_uri=True),
|
||
p=Value(value=rdf_type, is_uri=True),
|
||
o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True)
|
||
),
|
||
|
||
# Label triple (automatic)
|
||
Triple(
|
||
s=Value(value=recipe_uri, is_uri=True),
|
||
p=Value(value=rdfs_label, is_uri=True),
|
||
o=Value(value="Cornish pasty recipe", is_uri=False)
|
||
),
|
||
|
||
# Attribute triples (objects are literals, not URIs)
|
||
Triple(
|
||
s=Value(value=recipe_uri, is_uri=True),
|
||
p=Value(value=serves_uri, is_uri=True),
|
||
o=Value(value="4-6 people", is_uri=False) # Literal value!
|
||
),
|
||
Triple(
|
||
s=Value(value=recipe_uri, is_uri=True),
|
||
p=Value(value=prep_time_uri, is_uri=True),
|
||
o=Value(value="45 minutes", is_uri=False) # Literal value!
|
||
)
|
||
]
|
||
```
|
||
|
||
**关键点:**
|
||
LLM 提取字面值:`"4-6 people"`, `"45 minutes"`
|
||
LLM 包含实体类型以消除歧义:`entity-type`
|
||
LLM 使用来自模式的属性名称:`"serves"`, `"preparation_time"`
|
||
代码从本体数据类型属性中查找属性 URI
|
||
**对象是字面值** (`is_uri=False`),而不是 URI 引用
|
||
值保持为自然文本,无需进行任何标准化
|
||
|
||
**与关系的差异:**
|
||
关系:主语和宾语都是实体(URI)
|
||
属性:主语是实体(URI),宾语是字面值(字符串/数字)
|
||
|
||
### 完整示例:实体 + 关系 + 属性
|
||
|
||
**输入文本:**
|
||
```
|
||
Cornish pasty is a savory pastry filled with beef and potatoes.
|
||
This recipe serves 4 people.
|
||
```
|
||
|
||
**LLM 返回的内容:**
|
||
```json
|
||
{
|
||
"entities": [
|
||
{
|
||
"entity": "Cornish pasty",
|
||
"type": "Recipe"
|
||
},
|
||
{
|
||
"entity": "beef",
|
||
"type": "Food"
|
||
},
|
||
{
|
||
"entity": "potatoes",
|
||
"type": "Food"
|
||
}
|
||
],
|
||
"relationships": [
|
||
{
|
||
"subject": "Cornish pasty",
|
||
"subject-type": "Recipe",
|
||
"relation": "has_ingredient",
|
||
"object": "beef",
|
||
"object-type": "Food"
|
||
},
|
||
{
|
||
"subject": "Cornish pasty",
|
||
"subject-type": "Recipe",
|
||
"relation": "has_ingredient",
|
||
"object": "potatoes",
|
||
"object-type": "Food"
|
||
}
|
||
],
|
||
"attributes": [
|
||
{
|
||
"entity": "Cornish pasty",
|
||
"entity-type": "Recipe",
|
||
"attribute": "serves",
|
||
"value": "4 people"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
**结果:** 生成了 11 个 RDF 三元组:
|
||
3 个实体类型三元组 (rdf:type)
|
||
3 个实体标签三元组 (rdfs:label) - 自动
|
||
2 个关系三元组 (has_ingredient)
|
||
1 个属性三元组 (serves)
|
||
|
||
所有内容均由 LLM 通过简单的、自然的语言提取得出!
|
||
|
||
## 参考文献
|
||
|
||
当前实现:`trustgraph-flow/trustgraph/extract/kg/ontology/extract.py`
|
||
提示模板:`ontology-prompt.md`
|
||
测试用例:`tests/unit/test_extract/test_ontology/`
|
||
示例本体:`e2e/test-data/food.ontology`
|