mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 16:36:21 +02:00
Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.
767 lines
22 KiB
Markdown
767 lines
22 KiB
Markdown
---
|
|
layout: default
|
|
title: "Ontology Knowledge Extraction - Phase 2 Refactor"
|
|
parent: "Tech Specs"
|
|
---
|
|
|
|
# Ontology Knowledge Extraction - Phase 2 Refactor
|
|
|
|
**Status**: Draft
|
|
**Author**: Analysis Session 2025-12-03
|
|
**Related**: `ontology.md`, `ontorag.md`
|
|
|
|
## Overview
|
|
|
|
This document identifies inconsistencies in the current ontology-based knowledge extraction system and proposes a refactor to improve LLM performance and reduce information loss.
|
|
|
|
## Current Implementation
|
|
|
|
### How It Works Now
|
|
|
|
1. **Ontology Loading** (`ontology_loader.py`)
|
|
- Loads ontology JSON with keys like `"fo/Recipe"`, `"fo/Food"`, `"fo/produces"`
|
|
- Class IDs include namespace prefix in the key itself
|
|
- Example from `food.ontology`:
|
|
```json
|
|
"classes": {
|
|
"fo/Recipe": {
|
|
"uri": "http://purl.org/ontology/fo/Recipe",
|
|
"rdfs:comment": "A Recipe is a combination..."
|
|
}
|
|
}
|
|
```
|
|
|
|
2. **Prompt Construction** (`extract.py:299-307`, `ontology-prompt.md`)
|
|
- Template receives `classes`, `object_properties`, `datatype_properties` dicts
|
|
- Template iterates: `{% for class_id, class_def in classes.items() %}`
|
|
- LLM sees: `**fo/Recipe**: A Recipe is a combination...`
|
|
- Example output format shows:
|
|
```json
|
|
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"}
|
|
{"subject": "recipe:cornish-pasty", "predicate": "has_ingredient", "object": "ingredient:flour"}
|
|
```
|
|
|
|
3. **Response Parsing** (`extract.py:382-428`)
|
|
- Expects JSON array: `[{"subject": "...", "predicate": "...", "object": "..."}]`
|
|
- Validates against ontology subset
|
|
- Expands URIs via `expand_uri()` (extract.py:473-521)
|
|
|
|
4. **URI Expansion** (`extract.py:473-521`)
|
|
- Checks if value is in `ontology_subset.classes` dict
|
|
- If found, extracts URI from class definition
|
|
- If not found, constructs URI: `f"https://trustgraph.ai/ontology/{ontology_id}#{value}"`
|
|
|
|
### Data Flow Example
|
|
|
|
**Ontology JSON → Loader → Prompt:**
|
|
```
|
|
"fo/Recipe" → classes["fo/Recipe"] → LLM sees "**fo/Recipe**"
|
|
```
|
|
|
|
**LLM → Parser → Output:**
|
|
```
|
|
"Recipe" → not in classes["fo/Recipe"] → constructs URI → LOSES original URI
|
|
"fo/Recipe" → found in classes → uses original URI → PRESERVES URI
|
|
```
|
|
|
|
## Problems Identified
|
|
|
|
### 1. **Inconsistent Examples in Prompt**
|
|
|
|
**Issue**: The prompt template shows class IDs with prefixes (`fo/Recipe`) but the example output uses unprefixed class names (`Recipe`).
|
|
|
|
**Location**: `ontology-prompt.md:5-52`
|
|
|
|
```markdown
|
|
## Ontology Classes:
|
|
- **fo/Recipe**: A Recipe is...
|
|
|
|
## Example Output:
|
|
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"}
|
|
```
|
|
|
|
**Impact**: LLM receives conflicting signals about what format to use.
|
|
|
|
### 2. **Information Loss in URI Expansion**
|
|
|
|
**Issue**: When LLM returns unprefixed class names following the example, `expand_uri()` can't find them in the ontology dict and constructs fallback URIs, losing the original proper URIs.
|
|
|
|
**Location**: `extract.py:494-500`
|
|
|
|
```python
|
|
if value in ontology_subset.classes: # Looks for "Recipe"
|
|
class_def = ontology_subset.classes[value] # But key is "fo/Recipe"
|
|
if isinstance(class_def, dict) and 'uri' in class_def:
|
|
return class_def['uri'] # Never reached!
|
|
return f"https://trustgraph.ai/ontology/{ontology_id}#{value}" # Fallback
|
|
```
|
|
|
|
**Impact**:
|
|
- Original URI: `http://purl.org/ontology/fo/Recipe`
|
|
- Constructed URI: `https://trustgraph.ai/ontology/food#Recipe`
|
|
- Semantic meaning lost, breaks interoperability
|
|
|
|
### 3. **Ambiguous Entity Instance Format**
|
|
|
|
**Issue**: No clear guidance on entity instance URI format.
|
|
|
|
**Examples in prompt**:
|
|
- `"recipe:cornish-pasty"` (namespace-like prefix)
|
|
- `"ingredient:flour"` (different prefix)
|
|
|
|
**Actual behavior** (extract.py:517-520):
|
|
```python
|
|
# Treat as entity instance - construct unique URI
|
|
normalized = value.replace(" ", "-").lower()
|
|
return f"https://trustgraph.ai/{ontology_id}/{normalized}"
|
|
```
|
|
|
|
**Impact**: LLM must guess prefixing convention with no ontology context.
|
|
|
|
### 4. **No Namespace Prefix Guidance**
|
|
|
|
**Issue**: The ontology JSON contains namespace definitions (line 10-25 in food.ontology):
|
|
```json
|
|
"namespaces": {
|
|
"fo": "http://purl.org/ontology/fo/",
|
|
"rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
|
|
...
|
|
}
|
|
```
|
|
|
|
But these are never surfaced to the LLM. The LLM doesn't know:
|
|
- What "fo" means
|
|
- What prefix to use for entities
|
|
- Which namespace applies to which elements
|
|
|
|
### 5. **Labels Not Used in Prompt**
|
|
|
|
**Issue**: Every class has `rdfs:label` fields (e.g., `{"value": "Recipe", "lang": "en-gb"}`), but the prompt template doesn't use them.
|
|
|
|
**Current**: Shows only `class_id` and `comment`
|
|
```jinja
|
|
- **{{class_id}}**{% if class_def.comment %}: {{class_def.comment}}{% endif %}
|
|
```
|
|
|
|
**Available but unused**:
|
|
```python
|
|
"rdfs:label": [{"value": "Recipe", "lang": "en-gb"}]
|
|
```
|
|
|
|
**Impact**: Could provide human-readable names alongside technical IDs.
|
|
|
|
## Proposed Solutions
|
|
|
|
### Option A: Normalize to Unprefixed IDs
|
|
|
|
**Approach**: Strip prefixes from class IDs before showing to LLM.
|
|
|
|
**Changes**:
|
|
1. Modify `build_extraction_variables()` to transform keys:
|
|
```python
|
|
classes_for_prompt = {
|
|
k.split('/')[-1]: v # "fo/Recipe" → "Recipe"
|
|
for k, v in ontology_subset.classes.items()
|
|
}
|
|
```
|
|
|
|
2. Update prompt example to match (already uses unprefixed names)
|
|
|
|
3. Modify `expand_uri()` to handle both formats:
|
|
```python
|
|
# Try exact match first
|
|
if value in ontology_subset.classes:
|
|
return ontology_subset.classes[value]['uri']
|
|
|
|
# Try with prefix
|
|
for prefix in ['fo/', 'rdf:', 'rdfs:']:
|
|
prefixed = f"{prefix}{value}"
|
|
if prefixed in ontology_subset.classes:
|
|
return ontology_subset.classes[prefixed]['uri']
|
|
```
|
|
|
|
**Pros**:
|
|
- Cleaner, more human-readable
|
|
- Matches existing prompt examples
|
|
- LLMs work better with simpler tokens
|
|
|
|
**Cons**:
|
|
- Class name collisions if multiple ontologies have same class name
|
|
- Loses namespace information
|
|
- Requires fallback logic for lookups
|
|
|
|
### Option B: Use Full Prefixed IDs Consistently
|
|
|
|
**Approach**: Update examples to use prefixed IDs matching what's shown in the class list.
|
|
|
|
**Changes**:
|
|
1. Update prompt example (ontology-prompt.md:46-52):
|
|
```json
|
|
[
|
|
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "fo/Recipe"},
|
|
{"subject": "recipe:cornish-pasty", "predicate": "rdfs:label", "object": "Cornish Pasty"},
|
|
{"subject": "recipe:cornish-pasty", "predicate": "fo/produces", "object": "food:cornish-pasty"},
|
|
{"subject": "food:cornish-pasty", "predicate": "rdf:type", "object": "fo/Food"}
|
|
]
|
|
```
|
|
|
|
2. Add namespace explanation to prompt:
|
|
```markdown
|
|
## Namespace Prefixes:
|
|
- **fo/**: Food Ontology (http://purl.org/ontology/fo/)
|
|
- **rdf:**: RDF Schema
|
|
- **rdfs:**: RDF Schema
|
|
|
|
Use these prefixes exactly as shown when referencing classes and properties.
|
|
```
|
|
|
|
3. Keep `expand_uri()` as-is (works correctly when matches found)
|
|
|
|
**Pros**:
|
|
- Input = Output consistency
|
|
- No information loss
|
|
- Preserves namespace semantics
|
|
- Works with multiple ontologies
|
|
|
|
**Cons**:
|
|
- More verbose tokens for LLM
|
|
- Requires LLM to track prefixes
|
|
|
|
### Option C: Hybrid - Show Both Label and ID
|
|
|
|
**Approach**: Enhance prompt to show both human-readable labels and technical IDs.
|
|
|
|
**Changes**:
|
|
1. Update prompt template:
|
|
```jinja
|
|
{% for class_id, class_def in classes.items() %}
|
|
- **{{class_id}}** (label: "{{class_def.labels[0].value if class_def.labels else class_id}}"){% if class_def.comment %}: {{class_def.comment}}{% endif %}
|
|
{% endfor %}
|
|
```
|
|
|
|
Example output:
|
|
```markdown
|
|
- **fo/Recipe** (label: "Recipe"): A Recipe is a combination...
|
|
```
|
|
|
|
2. Update instructions:
|
|
```markdown
|
|
When referencing classes:
|
|
- Use the full prefixed ID (e.g., "fo/Recipe") in JSON output
|
|
- The label (e.g., "Recipe") is for human understanding only
|
|
```
|
|
|
|
**Pros**:
|
|
- Clearest for LLM
|
|
- Preserves all information
|
|
- Explicit about what to use
|
|
|
|
**Cons**:
|
|
- Longer prompt
|
|
- More complex template
|
|
|
|
## Implemented Approach
|
|
|
|
**Simplified Entity-Relationship-Attribute Format** - completely replaces the old triple-based format.
|
|
|
|
The new approach was chosen because:
|
|
|
|
1. **No Information Loss**: Original URIs preserved correctly
|
|
2. **Simpler Logic**: No transformation needed, direct dict lookups work
|
|
3. **Namespace Safety**: Handles multiple ontologies without collisions
|
|
4. **Semantic Correctness**: Maintains RDF/OWL semantics
|
|
|
|
## Implementation Complete
|
|
|
|
### What Was Built:
|
|
|
|
1. **New Prompt Template** (`prompts/ontology-extract-v2.txt`)
|
|
- ✅ Clear sections: Entity Types, Relationships, Attributes
|
|
- ✅ Example using full type identifiers (`fo/Recipe`, `fo/has_ingredient`)
|
|
- ✅ Instructions to use exact identifiers from schema
|
|
- ✅ New JSON format with entities/relationships/attributes arrays
|
|
|
|
2. **Entity Normalization** (`entity_normalizer.py`)
|
|
- ✅ `normalize_entity_name()` - Converts names to URI-safe format
|
|
- ✅ `normalize_type_identifier()` - Handles slashes in types (`fo/Recipe` → `fo-recipe`)
|
|
- ✅ `build_entity_uri()` - Creates unique URIs using (name, type) tuple
|
|
- ✅ `EntityRegistry` - Tracks entities for deduplication
|
|
|
|
3. **JSON Parser** (`simplified_parser.py`)
|
|
- ✅ Parses new format: `{entities: [...], relationships: [...], attributes: [...]}`
|
|
- ✅ Supports kebab-case and snake_case field names
|
|
- ✅ Returns structured dataclasses
|
|
- ✅ Graceful error handling with logging
|
|
|
|
4. **Triple Converter** (`triple_converter.py`)
|
|
- ✅ `convert_entity()` - Generates type + label triples automatically
|
|
- ✅ `convert_relationship()` - Connects entity URIs via properties
|
|
- ✅ `convert_attribute()` - Adds literal values
|
|
- ✅ Looks up full URIs from ontology definitions
|
|
|
|
5. **Updated Main Processor** (`extract.py`)
|
|
- ✅ Removed old triple-based extraction code
|
|
- ✅ Added `extract_with_simplified_format()` method
|
|
- ✅ Now exclusively uses new simplified format
|
|
- ✅ Calls prompt with `extract-with-ontologies-v2` ID
|
|
|
|
## Test Cases
|
|
|
|
### Test 1: URI Preservation
|
|
```python
|
|
# Given ontology class
|
|
classes = {"fo/Recipe": {"uri": "http://purl.org/ontology/fo/Recipe", ...}}
|
|
|
|
# When LLM returns
|
|
llm_output = {"subject": "x", "predicate": "rdf:type", "object": "fo/Recipe"}
|
|
|
|
# Then expanded URI should be
|
|
assert expanded == "http://purl.org/ontology/fo/Recipe"
|
|
# Not: "https://trustgraph.ai/ontology/food#Recipe"
|
|
```
|
|
|
|
### Test 2: Multi-Ontology Collision
|
|
```python
|
|
# Given two ontologies
|
|
ont1 = {"fo/Recipe": {...}}
|
|
ont2 = {"cooking/Recipe": {...}}
|
|
|
|
# LLM should use full prefix to disambiguate
|
|
llm_output = {"object": "fo/Recipe"} # Not just "Recipe"
|
|
```
|
|
|
|
### Test 3: Entity Instance Format
|
|
```python
|
|
# Given prompt with food ontology
|
|
# LLM should create instances like
|
|
{"subject": "recipe:cornish-pasty"} # Namespace-style
|
|
{"subject": "food:beef"} # Consistent prefix
|
|
```
|
|
|
|
## Open Questions
|
|
|
|
1. **Should entity instances use namespace prefixes?**
|
|
- Current: `"recipe:cornish-pasty"` (arbitrary)
|
|
- Alternative: Use ontology prefix `"fo:cornish-pasty"`?
|
|
- Alternative: No prefix, expand in URI `"cornish-pasty"` → full URI?
|
|
|
|
2. **How to handle domain/range in prompt?**
|
|
- Currently shows: `(Recipe → Food)`
|
|
- Should it be: `(fo/Recipe → fo/Food)`?
|
|
|
|
3. **Should we validate domain/range constraints?**
|
|
- TODO comment at extract.py:470
|
|
- Would catch more errors but more complex
|
|
|
|
4. **What about inverse properties and equivalences?**
|
|
- Ontology has `owl:inverseOf`, `owl:equivalentClass`
|
|
- Not currently used in extraction
|
|
- Should they be?
|
|
|
|
## Success Metrics
|
|
|
|
- ✅ Zero URI information loss (100% preservation of original URIs)
|
|
- ✅ LLM output format matches input format
|
|
- ✅ No ambiguous examples in prompt
|
|
- ✅ Tests pass with multiple ontologies
|
|
- ✅ Improved extraction quality (measured by valid triple %)
|
|
|
|
## Alternative Approach: Simplified Extraction Format
|
|
|
|
### Philosophy
|
|
|
|
Instead of asking the LLM to understand RDF/OWL semantics, ask it to do what it's good at: **find entities and relationships in text**.
|
|
|
|
Let the code handle URI construction, RDF conversion, and semantic web formalities.
|
|
|
|
### Example: Entity Classification
|
|
|
|
**Input Text:**
|
|
```
|
|
Cornish pasty is a traditional British pastry filled with meat and vegetables.
|
|
```
|
|
|
|
**Ontology Schema (shown to LLM):**
|
|
```markdown
|
|
## Entity Types:
|
|
- Recipe: A recipe is a combination of ingredients and a method
|
|
- Food: A food is something that can be eaten
|
|
- Ingredient: An ingredient combines a quantity and a food
|
|
```
|
|
|
|
**What LLM Returns (Simple JSON):**
|
|
```json
|
|
{
|
|
"entities": [
|
|
{
|
|
"entity": "Cornish pasty",
|
|
"type": "Recipe"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**What Code Produces (RDF Triples):**
|
|
```python
|
|
# 1. Normalize entity name + type to ID (type prevents collisions)
|
|
entity_id = "recipe-cornish-pasty" # normalize("Cornish pasty", "Recipe")
|
|
entity_uri = "https://trustgraph.ai/food/recipe-cornish-pasty"
|
|
|
|
# Note: Same name, different type = different URI
|
|
# "Cornish pasty" (Recipe) → recipe-cornish-pasty
|
|
# "Cornish pasty" (Food) → food-cornish-pasty
|
|
|
|
# 2. Generate triples
|
|
triples = [
|
|
# Type triple
|
|
Triple(
|
|
s=Value(value=entity_uri, is_uri=True),
|
|
p=Value(value="http://www.w3.org/1999/02/22-rdf-syntax-ns#type", is_uri=True),
|
|
o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True)
|
|
),
|
|
# Label triple (automatic)
|
|
Triple(
|
|
s=Value(value=entity_uri, is_uri=True),
|
|
p=Value(value="http://www.w3.org/2000/01/rdf-schema#label", is_uri=True),
|
|
o=Value(value="Cornish pasty", is_uri=False)
|
|
)
|
|
]
|
|
```
|
|
|
|
### Benefits
|
|
|
|
1. **LLM doesn't need to:**
|
|
- Understand URI syntax
|
|
- Invent identifier prefixes (`recipe:`, `ingredient:`)
|
|
- Know about `rdf:type` or `rdfs:label`
|
|
- Construct semantic web identifiers
|
|
|
|
2. **LLM just needs to:**
|
|
- Find entities in text
|
|
- Map them to ontology classes
|
|
- Extract relationships and attributes
|
|
|
|
3. **Code handles:**
|
|
- URI normalization and construction
|
|
- RDF triple generation
|
|
- Automatic label assignment
|
|
- Namespace management
|
|
|
|
### Why This Works Better
|
|
|
|
- **Simpler prompt** = less confusion = fewer errors
|
|
- **Consistent IDs** = code controls normalization rules
|
|
- **Auto-generated labels** = no missing rdfs:label triples
|
|
- **LLM focuses on extraction** = what it's actually good at
|
|
|
|
### Example: Entity Relationships
|
|
|
|
**Input Text:**
|
|
```
|
|
Cornish pasty is a traditional British pastry filled with beef and potatoes.
|
|
```
|
|
|
|
**Ontology Schema (shown to LLM):**
|
|
```markdown
|
|
## Entity Types:
|
|
- Recipe: A recipe is a combination of ingredients and a method
|
|
- Food: A food is something that can be eaten
|
|
- Ingredient: An ingredient combines a quantity and a food
|
|
|
|
## Relationships:
|
|
- has_ingredient: Relates a recipe to an ingredient it uses (Recipe → Ingredient)
|
|
- food: Relates an ingredient to the food that is required (Ingredient → Food)
|
|
```
|
|
|
|
**What LLM Returns (Simple JSON):**
|
|
```json
|
|
{
|
|
"entities": [
|
|
{
|
|
"entity": "Cornish pasty",
|
|
"type": "Recipe"
|
|
},
|
|
{
|
|
"entity": "beef",
|
|
"type": "Food"
|
|
},
|
|
{
|
|
"entity": "potatoes",
|
|
"type": "Food"
|
|
}
|
|
],
|
|
"relationships": [
|
|
{
|
|
"subject": "Cornish pasty",
|
|
"subject-type": "Recipe",
|
|
"relation": "has_ingredient",
|
|
"object": "beef",
|
|
"object-type": "Food"
|
|
},
|
|
{
|
|
"subject": "Cornish pasty",
|
|
"subject-type": "Recipe",
|
|
"relation": "has_ingredient",
|
|
"object": "potatoes",
|
|
"object-type": "Food"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**What Code Produces (RDF Triples):**
|
|
```python
|
|
# Normalize entity names to URIs
|
|
cornish_pasty_uri = "https://trustgraph.ai/food/cornish-pasty"
|
|
beef_uri = "https://trustgraph.ai/food/beef"
|
|
potatoes_uri = "https://trustgraph.ai/food/potatoes"
|
|
|
|
# Look up relation URI from ontology
|
|
has_ingredient_uri = "http://purl.org/ontology/fo/ingredients" # from fo/has_ingredient
|
|
|
|
triples = [
|
|
# Entity type triples (as before)
|
|
Triple(s=cornish_pasty_uri, p=rdf_type, o="http://purl.org/ontology/fo/Recipe"),
|
|
Triple(s=cornish_pasty_uri, p=rdfs_label, o="Cornish pasty"),
|
|
|
|
Triple(s=beef_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"),
|
|
Triple(s=beef_uri, p=rdfs_label, o="beef"),
|
|
|
|
Triple(s=potatoes_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"),
|
|
Triple(s=potatoes_uri, p=rdfs_label, o="potatoes"),
|
|
|
|
# Relationship triples
|
|
Triple(
|
|
s=Value(value=cornish_pasty_uri, is_uri=True),
|
|
p=Value(value=has_ingredient_uri, is_uri=True),
|
|
o=Value(value=beef_uri, is_uri=True)
|
|
),
|
|
Triple(
|
|
s=Value(value=cornish_pasty_uri, is_uri=True),
|
|
p=Value(value=has_ingredient_uri, is_uri=True),
|
|
o=Value(value=potatoes_uri, is_uri=True)
|
|
)
|
|
]
|
|
```
|
|
|
|
**Key Points:**
|
|
- LLM returns natural language entity names: `"Cornish pasty"`, `"beef"`, `"potatoes"`
|
|
- LLM includes types to disambiguate: `subject-type`, `object-type`
|
|
- LLM uses relation name from schema: `"has_ingredient"`
|
|
- Code derives consistent IDs using (name, type): `("Cornish pasty", "Recipe")` → `recipe-cornish-pasty`
|
|
- Code looks up relation URI from ontology: `fo/has_ingredient` → full URI
|
|
- Same (name, type) tuple always gets same URI (deduplication)
|
|
|
|
### Example: Entity Name Disambiguation
|
|
|
|
**Problem:** Same name can refer to different entity types.
|
|
|
|
**Real-world case:**
|
|
```
|
|
"Cornish pasty" can be:
|
|
- A Recipe (instructions for making it)
|
|
- A Food (the dish itself)
|
|
```
|
|
|
|
**How It's Handled:**
|
|
|
|
LLM returns both as separate entities:
|
|
```json
|
|
{
|
|
"entities": [
|
|
{"entity": "Cornish pasty", "type": "Recipe"},
|
|
{"entity": "Cornish pasty", "type": "Food"}
|
|
],
|
|
"relationships": [
|
|
{
|
|
"subject": "Cornish pasty",
|
|
"subject-type": "Recipe",
|
|
"relation": "produces",
|
|
"object": "Cornish pasty",
|
|
"object-type": "Food"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Code Resolution:**
|
|
```python
|
|
# Different types → different URIs
|
|
recipe_uri = normalize("Cornish pasty", "Recipe")
|
|
# → "https://trustgraph.ai/food/recipe-cornish-pasty"
|
|
|
|
food_uri = normalize("Cornish pasty", "Food")
|
|
# → "https://trustgraph.ai/food/food-cornish-pasty"
|
|
|
|
# Relationship connects them correctly
|
|
triple = Triple(
|
|
s=recipe_uri, # The Recipe
|
|
p="http://purl.org/ontology/fo/produces",
|
|
o=food_uri # The Food
|
|
)
|
|
```
|
|
|
|
**Why This Works:**
|
|
- Type is included in ALL references (entities, relationships, attributes)
|
|
- Code uses `(name, type)` tuple as lookup key
|
|
- No ambiguity, no collisions
|
|
|
|
### Example: Entity Attributes
|
|
|
|
**Input Text:**
|
|
```
|
|
This Cornish pasty recipe serves 4-6 people and takes 45 minutes to prepare.
|
|
```
|
|
|
|
**Ontology Schema (shown to LLM):**
|
|
```markdown
|
|
## Entity Types:
|
|
- Recipe: A recipe is a combination of ingredients and a method
|
|
|
|
## Attributes:
|
|
- serves: Indicates what the recipe is intended to serve (Recipe → text)
|
|
- preparation_time: Time needed to prepare the recipe (Recipe → text)
|
|
```
|
|
|
|
**What LLM Returns (Simple JSON):**
|
|
```json
|
|
{
|
|
"entities": [
|
|
{
|
|
"entity": "Cornish pasty recipe",
|
|
"type": "Recipe"
|
|
}
|
|
],
|
|
"attributes": [
|
|
{
|
|
"entity": "Cornish pasty recipe",
|
|
"entity-type": "Recipe",
|
|
"attribute": "serves",
|
|
"value": "4-6 people"
|
|
},
|
|
{
|
|
"entity": "Cornish pasty recipe",
|
|
"entity-type": "Recipe",
|
|
"attribute": "preparation_time",
|
|
"value": "45 minutes"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**What Code Produces (RDF Triples):**
|
|
```python
|
|
# Normalize entity name to URI
|
|
recipe_uri = "https://trustgraph.ai/food/cornish-pasty-recipe"
|
|
|
|
# Look up attribute URIs from ontology
|
|
serves_uri = "http://purl.org/ontology/fo/serves" # from fo/serves
|
|
prep_time_uri = "http://purl.org/ontology/fo/preparation_time" # from fo/preparation_time
|
|
|
|
triples = [
|
|
# Entity type triple
|
|
Triple(
|
|
s=Value(value=recipe_uri, is_uri=True),
|
|
p=Value(value=rdf_type, is_uri=True),
|
|
o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True)
|
|
),
|
|
|
|
# Label triple (automatic)
|
|
Triple(
|
|
s=Value(value=recipe_uri, is_uri=True),
|
|
p=Value(value=rdfs_label, is_uri=True),
|
|
o=Value(value="Cornish pasty recipe", is_uri=False)
|
|
),
|
|
|
|
# Attribute triples (objects are literals, not URIs)
|
|
Triple(
|
|
s=Value(value=recipe_uri, is_uri=True),
|
|
p=Value(value=serves_uri, is_uri=True),
|
|
o=Value(value="4-6 people", is_uri=False) # Literal value!
|
|
),
|
|
Triple(
|
|
s=Value(value=recipe_uri, is_uri=True),
|
|
p=Value(value=prep_time_uri, is_uri=True),
|
|
o=Value(value="45 minutes", is_uri=False) # Literal value!
|
|
)
|
|
]
|
|
```
|
|
|
|
**Key Points:**
|
|
- LLM extracts literal values: `"4-6 people"`, `"45 minutes"`
|
|
- LLM includes entity type for disambiguation: `entity-type`
|
|
- LLM uses attribute name from schema: `"serves"`, `"preparation_time"`
|
|
- Code looks up attribute URI from ontology datatype properties
|
|
- **Object is literal** (`is_uri=False`), not a URI reference
|
|
- Values stay as natural text, no normalization needed
|
|
|
|
**Difference from Relationships:**
|
|
- Relationships: both subject and object are entities (URIs)
|
|
- Attributes: subject is entity (URI), object is literal value (string/number)
|
|
|
|
### Complete Example: Entities + Relationships + Attributes
|
|
|
|
**Input Text:**
|
|
```
|
|
Cornish pasty is a savory pastry filled with beef and potatoes.
|
|
This recipe serves 4 people.
|
|
```
|
|
|
|
**What LLM Returns:**
|
|
```json
|
|
{
|
|
"entities": [
|
|
{
|
|
"entity": "Cornish pasty",
|
|
"type": "Recipe"
|
|
},
|
|
{
|
|
"entity": "beef",
|
|
"type": "Food"
|
|
},
|
|
{
|
|
"entity": "potatoes",
|
|
"type": "Food"
|
|
}
|
|
],
|
|
"relationships": [
|
|
{
|
|
"subject": "Cornish pasty",
|
|
"subject-type": "Recipe",
|
|
"relation": "has_ingredient",
|
|
"object": "beef",
|
|
"object-type": "Food"
|
|
},
|
|
{
|
|
"subject": "Cornish pasty",
|
|
"subject-type": "Recipe",
|
|
"relation": "has_ingredient",
|
|
"object": "potatoes",
|
|
"object-type": "Food"
|
|
}
|
|
],
|
|
"attributes": [
|
|
{
|
|
"entity": "Cornish pasty",
|
|
"entity-type": "Recipe",
|
|
"attribute": "serves",
|
|
"value": "4 people"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Result:** 11 RDF triples generated:
|
|
- 3 entity type triples (rdf:type)
|
|
- 3 entity label triples (rdfs:label) - automatic
|
|
- 2 relationship triples (has_ingredient)
|
|
- 1 attribute triple (serves)
|
|
|
|
All from simple, natural language extractions by the LLM!
|
|
|
|
## References
|
|
|
|
- Current implementation: `trustgraph-flow/trustgraph/extract/kg/ontology/extract.py`
|
|
- Prompt template: `ontology-prompt.md`
|
|
- Test cases: `tests/unit/test_extract/test_ontology/`
|
|
- Example ontology: `e2e/test-data/food.ontology`
|