mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 08:26:21 +02:00
770 lines
22 KiB
Markdown
770 lines
22 KiB
Markdown
|
|
---
|
|||
|
|
layout: default
|
|||
|
|
title: "本体知识抽取 - 第二阶段重构"
|
|||
|
|
parent: "Chinese (Beta)"
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# 本体知识抽取 - 第二阶段重构
|
|||
|
|
|
|||
|
|
> **Beta Translation:** This document was translated via Machine Learning and as such may not be 100% accurate. All non-English languages are currently classified as Beta.
|
|||
|
|
|
|||
|
|
**状态**: 草稿
|
|||
|
|
**作者**: 分析会议 2025-12-03
|
|||
|
|
**相关**: `ontology.md`, `ontorag.md`
|
|||
|
|
|
|||
|
|
## 概述
|
|||
|
|
|
|||
|
|
本文档识别了当前基于本体的知识抽取系统中存在的缺陷,并提出了重构方案,以提高 LLM 的性能并减少信息损失。
|
|||
|
|
|
|||
|
|
## 当前实现
|
|||
|
|
|
|||
|
|
### 当前工作方式
|
|||
|
|
|
|||
|
|
1. **本体加载** (`ontology_loader.py`)
|
|||
|
|
加载本体 JSON 文件,其中包含键,例如 `"fo/Recipe"`, `"fo/Food"`, `"fo/produces"`
|
|||
|
|
类 ID 在键本身中包含命名空间前缀
|
|||
|
|
示例来自 `food.ontology`:
|
|||
|
|
```json
|
|||
|
|
"classes": {
|
|||
|
|
"fo/Recipe": {
|
|||
|
|
"uri": "http://purl.org/ontology/fo/Recipe",
|
|||
|
|
"rdfs:comment": "A Recipe is a combination..."
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. **提示构建** (`extract.py:299-307`, `ontology-prompt.md`)
|
|||
|
|
模板接收 `classes`, `object_properties`, `datatype_properties` 字典
|
|||
|
|
模板循环:`{% for class_id, class_def in classes.items() %}`
|
|||
|
|
LLM 看到:`**fo/Recipe**: A Recipe is a combination...`
|
|||
|
|
示例输出格式如下:
|
|||
|
|
```json
|
|||
|
|
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"}
|
|||
|
|
{"subject": "recipe:cornish-pasty", "predicate": "has_ingredient", "object": "ingredient:flour"}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
3. **响应解析** (`extract.py:382-428`)
|
|||
|
|
期望接收 JSON 数组: `[{"subject": "...", "predicate": "...", "object": "..."}]`
|
|||
|
|
验证是否符合本体子集
|
|||
|
|
通过 `expand_uri()` 扩展 URI (extract.py:473-521)
|
|||
|
|
|
|||
|
|
4. **URI 扩展** (`extract.py:473-521`)
|
|||
|
|
检查值是否在 `ontology_subset.classes` 字典中
|
|||
|
|
如果找到,从类定义中提取 URI
|
|||
|
|
如果未找到,则构建 URI: `f"https://trustgraph.ai/ontology/{ontology_id}#{value}"`
|
|||
|
|
|
|||
|
|
### 数据流示例
|
|||
|
|
|
|||
|
|
**本体 JSON → 加载器 → 提示:**
|
|||
|
|
```
|
|||
|
|
"fo/Recipe" → classes["fo/Recipe"] → LLM sees "**fo/Recipe**"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**LLM → 解析器 → 输出:**
|
|||
|
|
```
|
|||
|
|
"Recipe" → not in classes["fo/Recipe"] → constructs URI → LOSES original URI
|
|||
|
|
"fo/Recipe" → found in classes → uses original URI → PRESERVES URI
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 发现的问题
|
|||
|
|
|
|||
|
|
### 1. **提示语中的示例不一致**
|
|||
|
|
|
|||
|
|
**问题**: 提示模板显示带有前缀的类ID (`fo/Recipe`),但示例输出使用不带前缀的类名 (`Recipe`)。
|
|||
|
|
|
|||
|
|
**位置**: `ontology-prompt.md:5-52`
|
|||
|
|
|
|||
|
|
```markdown
|
|||
|
|
## Ontology Classes:
|
|||
|
|
- **fo/Recipe**: A Recipe is...
|
|||
|
|
|
|||
|
|
## Example Output:
|
|||
|
|
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**影响:** LLM 接收到关于应该使用哪种格式的冲突信号。
|
|||
|
|
|
|||
|
|
### 2. **URI 扩展中的信息丢失**
|
|||
|
|
|
|||
|
|
**问题:** 当 LLM 返回不带前缀的类名,例如示例中的情况时,`expand_uri()` 无法在本体字典中找到它们,而是构造了备用 URI,从而丢失了原始的正确 URI。
|
|||
|
|
|
|||
|
|
**位置:** `extract.py:494-500`
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
if value in ontology_subset.classes: # Looks for "Recipe"
|
|||
|
|
class_def = ontology_subset.classes[value] # But key is "fo/Recipe"
|
|||
|
|
if isinstance(class_def, dict) and 'uri' in class_def:
|
|||
|
|
return class_def['uri'] # Never reached!
|
|||
|
|
return f"https://trustgraph.ai/ontology/{ontology_id}#{value}" # Fallback
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**影响 (Impact):**
|
|||
|
|
原始 URI: `http://purl.org/ontology/fo/Recipe`
|
|||
|
|
构建的 URI: `https://trustgraph.ai/ontology/food#Recipe`
|
|||
|
|
语义信息丢失,破坏互操作性
|
|||
|
|
|
|||
|
|
### 3. **实体实例格式不明确 (Ambiguous Entity Instance Format)**
|
|||
|
|
|
|||
|
|
**问题 (Issue):** 没有关于实体实例 URI 格式的明确指导。
|
|||
|
|
|
|||
|
|
**提示中的示例 (Examples in prompt):**
|
|||
|
|
`"recipe:cornish-pasty"` (类似于命名空间的 前缀)
|
|||
|
|
`"ingredient:flour"` (不同的前缀)
|
|||
|
|
|
|||
|
|
**实际行为 (Actual behavior) (extract.py:517-520):**
|
|||
|
|
```python
|
|||
|
|
# Treat as entity instance - construct unique URI
|
|||
|
|
normalized = value.replace(" ", "-").lower()
|
|||
|
|
return f"https://trustgraph.ai/{ontology_id}/{normalized}"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**影响:** LLM 必须在没有任何本体知识的情况下猜测前缀约定。
|
|||
|
|
|
|||
|
|
### 4. **没有命名空间前缀的指导**
|
|||
|
|
|
|||
|
|
**问题:** 本体 JSON 包含命名空间定义(food.ontology 中的第 10-25 行):
|
|||
|
|
```json
|
|||
|
|
"namespaces": {
|
|||
|
|
"fo": "http://purl.org/ontology/fo/",
|
|||
|
|
"rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
|
|||
|
|
...
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
但是这些内容永远不会传递给大型语言模型。大型语言模型不知道:
|
|||
|
|
"fo" 的含义
|
|||
|
|
应使用哪个前缀来表示实体
|
|||
|
|
哪个命名空间适用于哪些元素
|
|||
|
|
|
|||
|
|
### 5. **未在提示中使用标签**
|
|||
|
|
|
|||
|
|
**问题:** 每一个类都有 `rdfs:label` 字段(例如,`{"value": "Recipe", "lang": "en-gb"}`),但提示模板没有使用它们。
|
|||
|
|
|
|||
|
|
**当前:** 只显示 `class_id` 和 `comment`
|
|||
|
|
```jinja
|
|||
|
|
- **{{class_id}}**{% if class_def.comment %}: {{class_def.comment}}{% endif %}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**可用但未使用的:**
|
|||
|
|
```python
|
|||
|
|
"rdfs:label": [{"value": "Recipe", "lang": "en-gb"}]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**影响:** 可以为技术 ID 旁边提供人类可读的名称。
|
|||
|
|
|
|||
|
|
## 提出的解决方案
|
|||
|
|
|
|||
|
|
### 方案 A:标准化为不带前缀的 ID
|
|||
|
|
|
|||
|
|
**方法:** 在向 LLM 显示之前,从类 ID 中移除前缀。
|
|||
|
|
|
|||
|
|
**变更:**
|
|||
|
|
1. 修改 `build_extraction_variables()` 以转换键:
|
|||
|
|
```python
|
|||
|
|
classes_for_prompt = {
|
|||
|
|
k.split('/')[-1]: v # "fo/Recipe" → "Recipe"
|
|||
|
|
for k, v in ontology_subset.classes.items()
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. 将提示示例更新为匹配项(已使用未加前缀的名称)。
|
|||
|
|
|
|||
|
|
3. 修改 `expand_uri()` 以处理两种格式:
|
|||
|
|
```python
|
|||
|
|
# Try exact match first
|
|||
|
|
if value in ontology_subset.classes:
|
|||
|
|
return ontology_subset.classes[value]['uri']
|
|||
|
|
|
|||
|
|
# Try with prefix
|
|||
|
|
for prefix in ['fo/', 'rdf:', 'rdfs:']:
|
|||
|
|
prefixed = f"{prefix}{value}"
|
|||
|
|
if prefixed in ontology_subset.classes:
|
|||
|
|
return ontology_subset.classes[prefixed]['uri']
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**优点:**
|
|||
|
|
更清晰,更易于人类阅读
|
|||
|
|
与现有的提示示例相符
|
|||
|
|
LLM(大型语言模型)在处理更简单的token时效果更好
|
|||
|
|
|
|||
|
|
**缺点:**
|
|||
|
|
如果多个本体具有相同的类名,则可能发生类名冲突
|
|||
|
|
丢失命名空间信息
|
|||
|
|
需要回退逻辑来执行查找
|
|||
|
|
|
|||
|
|
### 选项 B:始终使用完整的带前缀的 ID
|
|||
|
|
|
|||
|
|
**方法:** 更新示例,使其使用与类列表中显示的前缀 ID 匹配。
|
|||
|
|
|
|||
|
|
**更改:**
|
|||
|
|
1. 更新提示示例(ontology-prompt.md:46-52):
|
|||
|
|
```json
|
|||
|
|
[
|
|||
|
|
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "fo/Recipe"},
|
|||
|
|
{"subject": "recipe:cornish-pasty", "predicate": "rdfs:label", "object": "Cornish Pasty"},
|
|||
|
|
{"subject": "recipe:cornish-pasty", "predicate": "fo/produces", "object": "food:cornish-pasty"},
|
|||
|
|
{"subject": "food:cornish-pasty", "predicate": "rdf:type", "object": "fo/Food"}
|
|||
|
|
]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. 在提示语中添加命名空间说明:
|
|||
|
|
```markdown
|
|||
|
|
## Namespace Prefixes:
|
|||
|
|
- **fo/**: Food Ontology (http://purl.org/ontology/fo/)
|
|||
|
|
- **rdf:**: RDF Schema
|
|||
|
|
- **rdfs:**: RDF Schema
|
|||
|
|
|
|||
|
|
Use these prefixes exactly as shown when referencing classes and properties.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
3. 保持 `expand_uri()` 的原样 (当找到匹配项时,它可以正常工作)。
|
|||
|
|
|
|||
|
|
**优点:**
|
|||
|
|
输入 = 输出的一致性。
|
|||
|
|
没有信息损失。
|
|||
|
|
保留命名空间语义。
|
|||
|
|
适用于多个本体。
|
|||
|
|
|
|||
|
|
**缺点:**
|
|||
|
|
对于 LLM 来说,token 更加冗长。
|
|||
|
|
需要 LLM 跟踪前缀。
|
|||
|
|
|
|||
|
|
### 选项 C:混合 - 同时显示标签和 ID
|
|||
|
|
|
|||
|
|
**方法:** 增强提示,同时显示人类可读的标签和技术 ID。
|
|||
|
|
|
|||
|
|
**更改:**
|
|||
|
|
1. 更新提示模板:
|
|||
|
|
```jinja
|
|||
|
|
{% for class_id, class_def in classes.items() %}
|
|||
|
|
- **{{class_id}}** (label: "{{class_def.labels[0].value if class_def.labels else class_id}}"){% if class_def.comment %}: {{class_def.comment}}{% endif %}
|
|||
|
|
{% endfor %}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
示例输出:
|
|||
|
|
```markdown
|
|||
|
|
- **fo/Recipe** (label: "Recipe"): A Recipe is a combination...
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. 更新说明:
|
|||
|
|
```markdown
|
|||
|
|
When referencing classes:
|
|||
|
|
- Use the full prefixed ID (e.g., "fo/Recipe") in JSON output
|
|||
|
|
- The label (e.g., "Recipe") is for human understanding only
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**优点 (Pros)**:
|
|||
|
|
对 LLM 最清晰
|
|||
|
|
保留所有信息
|
|||
|
|
明确说明应该使用什么
|
|||
|
|
|
|||
|
|
**缺点 (Cons)**:
|
|||
|
|
提示更长
|
|||
|
|
模板更复杂
|
|||
|
|
|
|||
|
|
## 实施方法 (Implemented Approach)
|
|||
|
|
|
|||
|
|
**简化的实体-关系-属性格式** - 完全取代了旧的三元组格式。
|
|||
|
|
|
|||
|
|
选择了这种新方法的原因是:
|
|||
|
|
|
|||
|
|
1. **无信息损失 (No Information Loss)**: 原始 URI 正确保留
|
|||
|
|
2. **更简单的逻辑 (Simpler Logic)**: 无需转换,可以直接使用字典查找
|
|||
|
|
3. **命名空间安全 (Namespace Safety)**: 能够处理多个本体而不会发生冲突
|
|||
|
|
4. **语义正确性 (Semantic Correctness)**: 保持 RDF/OWL 语义
|
|||
|
|
|
|||
|
|
## 实施完成 (Implementation Complete)
|
|||
|
|
|
|||
|
|
### 构建内容 (What Was Built):
|
|||
|
|
|
|||
|
|
1. **新的提示模板 (New Prompt Template)** (`prompts/ontology-extract-v2.txt`)
|
|||
|
|
✅ 清晰的章节:实体类型、关系、属性
|
|||
|
|
✅ 使用完整类型标识符的示例 (`fo/Recipe`, `fo/has_ingredient`)
|
|||
|
|
✅ 指示使用模式中确切的标识符
|
|||
|
|
✅ 新的 JSON 格式,包含实体/关系/属性数组
|
|||
|
|
|
|||
|
|
2. **实体规范化 (Entity Normalization)** (`entity_normalizer.py`)
|
|||
|
|
✅ `normalize_entity_name()` - 将名称转换为 URI 安全格式
|
|||
|
|
✅ `normalize_type_identifier()` - 处理类型中的斜杠 (`fo/Recipe` → `fo-recipe`)
|
|||
|
|
✅ `build_entity_uri()` - 使用 (名称, 类型) 元组创建唯一 URI
|
|||
|
|
✅ `EntityRegistry` - 跟踪实体以进行去重
|
|||
|
|
|
|||
|
|
3. **JSON 解析器 (JSON Parser)** (`simplified_parser.py`)
|
|||
|
|
✅ 解析新的格式:`{entities: [...], relationships: [...], attributes: [...]}`
|
|||
|
|
✅ 支持 kebab-case 和 snake_case 字段名称
|
|||
|
|
✅ 返回结构化的数据类
|
|||
|
|
✅ 具有优雅的错误处理和日志记录
|
|||
|
|
|
|||
|
|
4. **三元组转换器 (Triple Converter)** (`triple_converter.py`)
|
|||
|
|
✅ `convert_entity()` - 自动生成类型 + 标签三元组
|
|||
|
|
✅ `convert_relationship()` - 通过属性连接实体 URI
|
|||
|
|
✅ `convert_attribute()` - 添加字面值
|
|||
|
|
✅ 从本体定义中查找完整的 URI
|
|||
|
|
|
|||
|
|
5. **更新的主要处理器 (Updated Main Processor)** (`extract.py`)
|
|||
|
|
✅ 删除了旧的三元组提取代码
|
|||
|
|
✅ 添加了 `extract_with_simplified_format()` 方法
|
|||
|
|
✅ 现在仅使用新的简化格式
|
|||
|
|
✅ 使用 `extract-with-ontologies-v2` ID 调用提示
|
|||
|
|
|
|||
|
|
## 测试用例 (Test Cases)
|
|||
|
|
|
|||
|
|
### 测试 1: URI 保留 (Test 1: URI Preservation)
|
|||
|
|
```python
|
|||
|
|
# Given ontology class
|
|||
|
|
classes = {"fo/Recipe": {"uri": "http://purl.org/ontology/fo/Recipe", ...}}
|
|||
|
|
|
|||
|
|
# When LLM returns
|
|||
|
|
llm_output = {"subject": "x", "predicate": "rdf:type", "object": "fo/Recipe"}
|
|||
|
|
|
|||
|
|
# Then expanded URI should be
|
|||
|
|
assert expanded == "http://purl.org/ontology/fo/Recipe"
|
|||
|
|
# Not: "https://trustgraph.ai/ontology/food#Recipe"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 测试 2:多本体冲突
|
|||
|
|
```python
|
|||
|
|
# Given two ontologies
|
|||
|
|
ont1 = {"fo/Recipe": {...}}
|
|||
|
|
ont2 = {"cooking/Recipe": {...}}
|
|||
|
|
|
|||
|
|
# LLM should use full prefix to disambiguate
|
|||
|
|
llm_output = {"object": "fo/Recipe"} # Not just "Recipe"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 测试 3:实体实例格式
|
|||
|
|
```python
|
|||
|
|
# Given prompt with food ontology
|
|||
|
|
# LLM should create instances like
|
|||
|
|
{"subject": "recipe:cornish-pasty"} # Namespace-style
|
|||
|
|
{"subject": "food:beef"} # Consistent prefix
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 待解决的问题
|
|||
|
|
|
|||
|
|
1. **实体实例是否应该使用命名空间前缀?**
|
|||
|
|
当前:`"recipe:cornish-pasty"` (任意)
|
|||
|
|
替代方案:使用本体前缀 `"fo:cornish-pasty"`?
|
|||
|
|
替代方案:不使用前缀,在 URI 中展开 `"cornish-pasty"` → 完整 URI?
|
|||
|
|
|
|||
|
|
2. **如何在提示中处理域/范围?**
|
|||
|
|
当前显示:`(Recipe → Food)`
|
|||
|
|
应该是:`(fo/Recipe → fo/Food)`?
|
|||
|
|
|
|||
|
|
3. **是否应该验证域/范围约束?**
|
|||
|
|
TODO 注释在 extract.py:470
|
|||
|
|
可以捕获更多错误,但更复杂
|
|||
|
|
|
|||
|
|
4. **关于反向属性和等价性?**
|
|||
|
|
本体有 `owl:inverseOf`,`owl:equivalentClass`
|
|||
|
|
当前未在提取中使用
|
|||
|
|
应该使用吗?
|
|||
|
|
|
|||
|
|
## 成功指标
|
|||
|
|
|
|||
|
|
✅ 零 URI 信息损失(100% 保留原始 URI)
|
|||
|
|
✅ LLM 输出格式与输入格式匹配
|
|||
|
|
✅ 提示中没有歧义的示例
|
|||
|
|
✅ 使用多个本体的测试通过
|
|||
|
|
✅ 提取质量得到改进(通过有效的三元组百分比衡量)
|
|||
|
|
|
|||
|
|
## 替代方法:简化的提取格式
|
|||
|
|
|
|||
|
|
### 理念
|
|||
|
|
|
|||
|
|
不要让 LLM 理解 RDF/OWL 语义,而是让它做擅长的事情:**在文本中查找实体和关系**。
|
|||
|
|
|
|||
|
|
让代码处理 URI 构造、RDF 转换和语义网规范。
|
|||
|
|
|
|||
|
|
### 示例:实体分类
|
|||
|
|
|
|||
|
|
**输入文本:**
|
|||
|
|
```
|
|||
|
|
Cornish pasty is a traditional British pastry filled with meat and vegetables.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**本体模式(显示给LLM):**
|
|||
|
|
```markdown
|
|||
|
|
## Entity Types:
|
|||
|
|
- Recipe: A recipe is a combination of ingredients and a method
|
|||
|
|
- Food: A food is something that can be eaten
|
|||
|
|
- Ingredient: An ingredient combines a quantity and a food
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**LLM 返回的内容(简单 JSON):**
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"entities": [
|
|||
|
|
{
|
|||
|
|
"entity": "Cornish pasty",
|
|||
|
|
"type": "Recipe"
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**生成的代码(RDF 三元组):**
|
|||
|
|
```python
|
|||
|
|
# 1. Normalize entity name + type to ID (type prevents collisions)
|
|||
|
|
entity_id = "recipe-cornish-pasty" # normalize("Cornish pasty", "Recipe")
|
|||
|
|
entity_uri = "https://trustgraph.ai/food/recipe-cornish-pasty"
|
|||
|
|
|
|||
|
|
# Note: Same name, different type = different URI
|
|||
|
|
# "Cornish pasty" (Recipe) → recipe-cornish-pasty
|
|||
|
|
# "Cornish pasty" (Food) → food-cornish-pasty
|
|||
|
|
|
|||
|
|
# 2. Generate triples
|
|||
|
|
triples = [
|
|||
|
|
# Type triple
|
|||
|
|
Triple(
|
|||
|
|
s=Value(value=entity_uri, is_uri=True),
|
|||
|
|
p=Value(value="http://www.w3.org/1999/02/22-rdf-syntax-ns#type", is_uri=True),
|
|||
|
|
o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True)
|
|||
|
|
),
|
|||
|
|
# Label triple (automatic)
|
|||
|
|
Triple(
|
|||
|
|
s=Value(value=entity_uri, is_uri=True),
|
|||
|
|
p=Value(value="http://www.w3.org/2000/01/rdf-schema#label", is_uri=True),
|
|||
|
|
o=Value(value="Cornish pasty", is_uri=False)
|
|||
|
|
)
|
|||
|
|
]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 优点
|
|||
|
|
|
|||
|
|
1. **LLM 不需要:**
|
|||
|
|
理解 URI 语法
|
|||
|
|
编造标识符前缀 (`recipe:`, `ingredient:`)
|
|||
|
|
了解 `rdf:type` 或 `rdfs:label`
|
|||
|
|
构建语义网标识符
|
|||
|
|
|
|||
|
|
2. **LLM 只需要:**
|
|||
|
|
在文本中找到实体
|
|||
|
|
将它们映射到本体类
|
|||
|
|
提取关系和属性
|
|||
|
|
|
|||
|
|
3. **代码负责:**
|
|||
|
|
URI 规范化和构建
|
|||
|
|
RDF 三元组生成
|
|||
|
|
自动标签分配
|
|||
|
|
命名空间管理
|
|||
|
|
|
|||
|
|
### 为什么这样更好
|
|||
|
|
|
|||
|
|
**更简单的提示** = 减少困惑 = 更少的错误
|
|||
|
|
**一致的 ID** = 代码控制规范化规则
|
|||
|
|
**自动生成的标签** = 没有遗漏的 rdfs:label 三元组
|
|||
|
|
**LLM 专注于提取** = 这是它擅长的事情
|
|||
|
|
|
|||
|
|
### 示例:实体关系
|
|||
|
|
|
|||
|
|
**输入文本:**
|
|||
|
|
```
|
|||
|
|
Cornish pasty is a traditional British pastry filled with beef and potatoes.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**本体模式(显示给LLM):**
|
|||
|
|
```markdown
|
|||
|
|
## Entity Types:
|
|||
|
|
- Recipe: A recipe is a combination of ingredients and a method
|
|||
|
|
- Food: A food is something that can be eaten
|
|||
|
|
- Ingredient: An ingredient combines a quantity and a food
|
|||
|
|
|
|||
|
|
## Relationships:
|
|||
|
|
- has_ingredient: Relates a recipe to an ingredient it uses (Recipe → Ingredient)
|
|||
|
|
- food: Relates an ingredient to the food that is required (Ingredient → Food)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**LLM 返回的内容(简单 JSON):**
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"entities": [
|
|||
|
|
{
|
|||
|
|
"entity": "Cornish pasty",
|
|||
|
|
"type": "Recipe"
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"entity": "beef",
|
|||
|
|
"type": "Food"
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"entity": "potatoes",
|
|||
|
|
"type": "Food"
|
|||
|
|
}
|
|||
|
|
],
|
|||
|
|
"relationships": [
|
|||
|
|
{
|
|||
|
|
"subject": "Cornish pasty",
|
|||
|
|
"subject-type": "Recipe",
|
|||
|
|
"relation": "has_ingredient",
|
|||
|
|
"object": "beef",
|
|||
|
|
"object-type": "Food"
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"subject": "Cornish pasty",
|
|||
|
|
"subject-type": "Recipe",
|
|||
|
|
"relation": "has_ingredient",
|
|||
|
|
"object": "potatoes",
|
|||
|
|
"object-type": "Food"
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**生成的代码(RDF 三元组):**
|
|||
|
|
```python
|
|||
|
|
# Normalize entity names to URIs
|
|||
|
|
cornish_pasty_uri = "https://trustgraph.ai/food/cornish-pasty"
|
|||
|
|
beef_uri = "https://trustgraph.ai/food/beef"
|
|||
|
|
potatoes_uri = "https://trustgraph.ai/food/potatoes"
|
|||
|
|
|
|||
|
|
# Look up relation URI from ontology
|
|||
|
|
has_ingredient_uri = "http://purl.org/ontology/fo/ingredients" # from fo/has_ingredient
|
|||
|
|
|
|||
|
|
triples = [
|
|||
|
|
# Entity type triples (as before)
|
|||
|
|
Triple(s=cornish_pasty_uri, p=rdf_type, o="http://purl.org/ontology/fo/Recipe"),
|
|||
|
|
Triple(s=cornish_pasty_uri, p=rdfs_label, o="Cornish pasty"),
|
|||
|
|
|
|||
|
|
Triple(s=beef_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"),
|
|||
|
|
Triple(s=beef_uri, p=rdfs_label, o="beef"),
|
|||
|
|
|
|||
|
|
Triple(s=potatoes_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"),
|
|||
|
|
Triple(s=potatoes_uri, p=rdfs_label, o="potatoes"),
|
|||
|
|
|
|||
|
|
# Relationship triples
|
|||
|
|
Triple(
|
|||
|
|
s=Value(value=cornish_pasty_uri, is_uri=True),
|
|||
|
|
p=Value(value=has_ingredient_uri, is_uri=True),
|
|||
|
|
o=Value(value=beef_uri, is_uri=True)
|
|||
|
|
),
|
|||
|
|
Triple(
|
|||
|
|
s=Value(value=cornish_pasty_uri, is_uri=True),
|
|||
|
|
p=Value(value=has_ingredient_uri, is_uri=True),
|
|||
|
|
o=Value(value=potatoes_uri, is_uri=True)
|
|||
|
|
)
|
|||
|
|
]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**关键点:**
|
|||
|
|
LLM 返回自然语言实体名称:`"Cornish pasty"`, `"beef"`, `"potatoes"`
|
|||
|
|
LLM 包含类型信息以消除歧义:`subject-type`, `object-type`
|
|||
|
|
LLM 使用模式中的关系名称:`"has_ingredient"`
|
|||
|
|
代码使用 (名称, 类型) 推导一致的 ID:`("Cornish pasty", "Recipe")` → `recipe-cornish-pasty`
|
|||
|
|
代码从本体中查找关系 URI:`fo/has_ingredient` → 完整 URI
|
|||
|
|
相同的 (名称, 类型) 元组始终获得相同的 URI (去重)
|
|||
|
|
|
|||
|
|
### 示例:实体名称消除歧义
|
|||
|
|
|
|||
|
|
**问题:** 相同的名称可能指不同的实体类型。
|
|||
|
|
|
|||
|
|
**实际案例:**
|
|||
|
|
```
|
|||
|
|
"Cornish pasty" can be:
|
|||
|
|
- A Recipe (instructions for making it)
|
|||
|
|
- A Food (the dish itself)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**处理方式:**
|
|||
|
|
|
|||
|
|
LLM 将两者都以独立的实体返回:
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"entities": [
|
|||
|
|
{"entity": "Cornish pasty", "type": "Recipe"},
|
|||
|
|
{"entity": "Cornish pasty", "type": "Food"}
|
|||
|
|
],
|
|||
|
|
"relationships": [
|
|||
|
|
{
|
|||
|
|
"subject": "Cornish pasty",
|
|||
|
|
"subject-type": "Recipe",
|
|||
|
|
"relation": "produces",
|
|||
|
|
"object": "Cornish pasty",
|
|||
|
|
"object-type": "Food"
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**代码解析:**
|
|||
|
|
```python
|
|||
|
|
# Different types → different URIs
|
|||
|
|
recipe_uri = normalize("Cornish pasty", "Recipe")
|
|||
|
|
# → "https://trustgraph.ai/food/recipe-cornish-pasty"
|
|||
|
|
|
|||
|
|
food_uri = normalize("Cornish pasty", "Food")
|
|||
|
|
# → "https://trustgraph.ai/food/food-cornish-pasty"
|
|||
|
|
|
|||
|
|
# Relationship connects them correctly
|
|||
|
|
triple = Triple(
|
|||
|
|
s=recipe_uri, # The Recipe
|
|||
|
|
p="http://purl.org/ontology/fo/produces",
|
|||
|
|
o=food_uri # The Food
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**原理:**
|
|||
|
|
类型包含在所有引用中(实体、关系、属性)。
|
|||
|
|
代码使用 `(name, type)` 元组作为查找键。
|
|||
|
|
没有歧义,没有冲突。
|
|||
|
|
|
|||
|
|
### 示例:实体属性
|
|||
|
|
|
|||
|
|
**输入文本:**
|
|||
|
|
```
|
|||
|
|
This Cornish pasty recipe serves 4-6 people and takes 45 minutes to prepare.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**本体模式(显示给LLM):**
|
|||
|
|
```markdown
|
|||
|
|
## Entity Types:
|
|||
|
|
- Recipe: A recipe is a combination of ingredients and a method
|
|||
|
|
|
|||
|
|
## Attributes:
|
|||
|
|
- serves: Indicates what the recipe is intended to serve (Recipe → text)
|
|||
|
|
- preparation_time: Time needed to prepare the recipe (Recipe → text)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**LLM 返回的内容(简单 JSON):**
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"entities": [
|
|||
|
|
{
|
|||
|
|
"entity": "Cornish pasty recipe",
|
|||
|
|
"type": "Recipe"
|
|||
|
|
}
|
|||
|
|
],
|
|||
|
|
"attributes": [
|
|||
|
|
{
|
|||
|
|
"entity": "Cornish pasty recipe",
|
|||
|
|
"entity-type": "Recipe",
|
|||
|
|
"attribute": "serves",
|
|||
|
|
"value": "4-6 people"
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"entity": "Cornish pasty recipe",
|
|||
|
|
"entity-type": "Recipe",
|
|||
|
|
"attribute": "preparation_time",
|
|||
|
|
"value": "45 minutes"
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**生成的代码(RDF 三元组):**
|
|||
|
|
```python
|
|||
|
|
# Normalize entity name to URI
|
|||
|
|
recipe_uri = "https://trustgraph.ai/food/cornish-pasty-recipe"
|
|||
|
|
|
|||
|
|
# Look up attribute URIs from ontology
|
|||
|
|
serves_uri = "http://purl.org/ontology/fo/serves" # from fo/serves
|
|||
|
|
prep_time_uri = "http://purl.org/ontology/fo/preparation_time" # from fo/preparation_time
|
|||
|
|
|
|||
|
|
triples = [
|
|||
|
|
# Entity type triple
|
|||
|
|
Triple(
|
|||
|
|
s=Value(value=recipe_uri, is_uri=True),
|
|||
|
|
p=Value(value=rdf_type, is_uri=True),
|
|||
|
|
o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True)
|
|||
|
|
),
|
|||
|
|
|
|||
|
|
# Label triple (automatic)
|
|||
|
|
Triple(
|
|||
|
|
s=Value(value=recipe_uri, is_uri=True),
|
|||
|
|
p=Value(value=rdfs_label, is_uri=True),
|
|||
|
|
o=Value(value="Cornish pasty recipe", is_uri=False)
|
|||
|
|
),
|
|||
|
|
|
|||
|
|
# Attribute triples (objects are literals, not URIs)
|
|||
|
|
Triple(
|
|||
|
|
s=Value(value=recipe_uri, is_uri=True),
|
|||
|
|
p=Value(value=serves_uri, is_uri=True),
|
|||
|
|
o=Value(value="4-6 people", is_uri=False) # Literal value!
|
|||
|
|
),
|
|||
|
|
Triple(
|
|||
|
|
s=Value(value=recipe_uri, is_uri=True),
|
|||
|
|
p=Value(value=prep_time_uri, is_uri=True),
|
|||
|
|
o=Value(value="45 minutes", is_uri=False) # Literal value!
|
|||
|
|
)
|
|||
|
|
]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**关键点:**
|
|||
|
|
LLM 提取字面值:`"4-6 people"`, `"45 minutes"`
|
|||
|
|
LLM 包含实体类型以消除歧义:`entity-type`
|
|||
|
|
LLM 使用来自模式的属性名称:`"serves"`, `"preparation_time"`
|
|||
|
|
代码从本体数据类型属性中查找属性 URI
|
|||
|
|
**对象是字面值** (`is_uri=False`),而不是 URI 引用
|
|||
|
|
值保持为自然文本,无需进行任何标准化
|
|||
|
|
|
|||
|
|
**与关系的差异:**
|
|||
|
|
关系:主语和宾语都是实体(URI)
|
|||
|
|
属性:主语是实体(URI),宾语是字面值(字符串/数字)
|
|||
|
|
|
|||
|
|
### 完整示例:实体 + 关系 + 属性
|
|||
|
|
|
|||
|
|
**输入文本:**
|
|||
|
|
```
|
|||
|
|
Cornish pasty is a savory pastry filled with beef and potatoes.
|
|||
|
|
This recipe serves 4 people.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**LLM 返回的内容:**
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"entities": [
|
|||
|
|
{
|
|||
|
|
"entity": "Cornish pasty",
|
|||
|
|
"type": "Recipe"
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"entity": "beef",
|
|||
|
|
"type": "Food"
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"entity": "potatoes",
|
|||
|
|
"type": "Food"
|
|||
|
|
}
|
|||
|
|
],
|
|||
|
|
"relationships": [
|
|||
|
|
{
|
|||
|
|
"subject": "Cornish pasty",
|
|||
|
|
"subject-type": "Recipe",
|
|||
|
|
"relation": "has_ingredient",
|
|||
|
|
"object": "beef",
|
|||
|
|
"object-type": "Food"
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"subject": "Cornish pasty",
|
|||
|
|
"subject-type": "Recipe",
|
|||
|
|
"relation": "has_ingredient",
|
|||
|
|
"object": "potatoes",
|
|||
|
|
"object-type": "Food"
|
|||
|
|
}
|
|||
|
|
],
|
|||
|
|
"attributes": [
|
|||
|
|
{
|
|||
|
|
"entity": "Cornish pasty",
|
|||
|
|
"entity-type": "Recipe",
|
|||
|
|
"attribute": "serves",
|
|||
|
|
"value": "4 people"
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**结果:** 生成了 11 个 RDF 三元组:
|
|||
|
|
3 个实体类型三元组 (rdf:type)
|
|||
|
|
3 个实体标签三元组 (rdfs:label) - 自动
|
|||
|
|
2 个关系三元组 (has_ingredient)
|
|||
|
|
1 个属性三元组 (serves)
|
|||
|
|
|
|||
|
|
所有内容均由 LLM 通过简单的、自然的语言提取得出!
|
|||
|
|
|
|||
|
|
## 参考文献
|
|||
|
|
|
|||
|
|
当前实现:`trustgraph-flow/trustgraph/extract/kg/ontology/extract.py`
|
|||
|
|
提示模板:`ontology-prompt.md`
|
|||
|
|
测试用例:`tests/unit/test_extract/test_ontology/`
|
|||
|
|
示例本体:`e2e/test-data/food.ontology`
|