Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.
22 KiB
| layout | title | parent |
|---|---|---|
| default | Ontology Knowledge Extraction - Phase 2 Refactor | Tech Specs |
Ontology Knowledge Extraction - Phase 2 Refactor
Status: Draft
Author: Analysis Session 2025-12-03
Related: ontology.md, ontorag.md
Overview
This document identifies inconsistencies in the current ontology-based knowledge extraction system and proposes a refactor to improve LLM performance and reduce information loss.
Current Implementation
How It Works Now
-
Ontology Loading (
ontology_loader.py)- Loads ontology JSON with keys like
"fo/Recipe","fo/Food","fo/produces" - Class IDs include namespace prefix in the key itself
- Example from
food.ontology:"classes": { "fo/Recipe": { "uri": "http://purl.org/ontology/fo/Recipe", "rdfs:comment": "A Recipe is a combination..." } }
- Loads ontology JSON with keys like
-
Prompt Construction (
extract.py:299-307,ontology-prompt.md)- Template receives
classes,object_properties,datatype_propertiesdicts - Template iterates:
{% for class_id, class_def in classes.items() %} - LLM sees:
**fo/Recipe**: A Recipe is a combination... - Example output format shows:
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"} {"subject": "recipe:cornish-pasty", "predicate": "has_ingredient", "object": "ingredient:flour"}
- Template receives
-
Response Parsing (
extract.py:382-428)- Expects JSON array:
[{"subject": "...", "predicate": "...", "object": "..."}] - Validates against ontology subset
- Expands URIs via
expand_uri()(extract.py:473-521)
- Expects JSON array:
-
URI Expansion (
extract.py:473-521)- Checks if value is in
ontology_subset.classesdict - If found, extracts URI from class definition
- If not found, constructs URI:
f"https://trustgraph.ai/ontology/{ontology_id}#{value}"
- Checks if value is in
Data Flow Example
Ontology JSON → Loader → Prompt:
"fo/Recipe" → classes["fo/Recipe"] → LLM sees "**fo/Recipe**"
LLM → Parser → Output:
"Recipe" → not in classes["fo/Recipe"] → constructs URI → LOSES original URI
"fo/Recipe" → found in classes → uses original URI → PRESERVES URI
Problems Identified
1. Inconsistent Examples in Prompt
Issue: The prompt template shows class IDs with prefixes (fo/Recipe) but the example output uses unprefixed class names (Recipe).
Location: ontology-prompt.md:5-52
## Ontology Classes:
- **fo/Recipe**: A Recipe is...
## Example Output:
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"}
Impact: LLM receives conflicting signals about what format to use.
2. Information Loss in URI Expansion
Issue: When LLM returns unprefixed class names following the example, expand_uri() can't find them in the ontology dict and constructs fallback URIs, losing the original proper URIs.
Location: extract.py:494-500
if value in ontology_subset.classes: # Looks for "Recipe"
class_def = ontology_subset.classes[value] # But key is "fo/Recipe"
if isinstance(class_def, dict) and 'uri' in class_def:
return class_def['uri'] # Never reached!
return f"https://trustgraph.ai/ontology/{ontology_id}#{value}" # Fallback
Impact:
- Original URI:
http://purl.org/ontology/fo/Recipe - Constructed URI:
https://trustgraph.ai/ontology/food#Recipe - Semantic meaning lost, breaks interoperability
3. Ambiguous Entity Instance Format
Issue: No clear guidance on entity instance URI format.
Examples in prompt:
"recipe:cornish-pasty"(namespace-like prefix)"ingredient:flour"(different prefix)
Actual behavior (extract.py:517-520):
# Treat as entity instance - construct unique URI
normalized = value.replace(" ", "-").lower()
return f"https://trustgraph.ai/{ontology_id}/{normalized}"
Impact: LLM must guess prefixing convention with no ontology context.
4. No Namespace Prefix Guidance
Issue: The ontology JSON contains namespace definitions (line 10-25 in food.ontology):
"namespaces": {
"fo": "http://purl.org/ontology/fo/",
"rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
...
}
But these are never surfaced to the LLM. The LLM doesn't know:
- What "fo" means
- What prefix to use for entities
- Which namespace applies to which elements
5. Labels Not Used in Prompt
Issue: Every class has rdfs:label fields (e.g., {"value": "Recipe", "lang": "en-gb"}), but the prompt template doesn't use them.
Current: Shows only class_id and comment
- **{{class_id}}**{% if class_def.comment %}: {{class_def.comment}}{% endif %}
Available but unused:
"rdfs:label": [{"value": "Recipe", "lang": "en-gb"}]
Impact: Could provide human-readable names alongside technical IDs.
Proposed Solutions
Option A: Normalize to Unprefixed IDs
Approach: Strip prefixes from class IDs before showing to LLM.
Changes:
-
Modify
build_extraction_variables()to transform keys:classes_for_prompt = { k.split('/')[-1]: v # "fo/Recipe" → "Recipe" for k, v in ontology_subset.classes.items() } -
Update prompt example to match (already uses unprefixed names)
-
Modify
expand_uri()to handle both formats:# Try exact match first if value in ontology_subset.classes: return ontology_subset.classes[value]['uri'] # Try with prefix for prefix in ['fo/', 'rdf:', 'rdfs:']: prefixed = f"{prefix}{value}" if prefixed in ontology_subset.classes: return ontology_subset.classes[prefixed]['uri']
Pros:
- Cleaner, more human-readable
- Matches existing prompt examples
- LLMs work better with simpler tokens
Cons:
- Class name collisions if multiple ontologies have same class name
- Loses namespace information
- Requires fallback logic for lookups
Option B: Use Full Prefixed IDs Consistently
Approach: Update examples to use prefixed IDs matching what's shown in the class list.
Changes:
-
Update prompt example (ontology-prompt.md:46-52):
[ {"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "fo/Recipe"}, {"subject": "recipe:cornish-pasty", "predicate": "rdfs:label", "object": "Cornish Pasty"}, {"subject": "recipe:cornish-pasty", "predicate": "fo/produces", "object": "food:cornish-pasty"}, {"subject": "food:cornish-pasty", "predicate": "rdf:type", "object": "fo/Food"} ] -
Add namespace explanation to prompt:
## Namespace Prefixes: - **fo/**: Food Ontology (http://purl.org/ontology/fo/) - **rdf:**: RDF Schema - **rdfs:**: RDF Schema Use these prefixes exactly as shown when referencing classes and properties. -
Keep
expand_uri()as-is (works correctly when matches found)
Pros:
- Input = Output consistency
- No information loss
- Preserves namespace semantics
- Works with multiple ontologies
Cons:
- More verbose tokens for LLM
- Requires LLM to track prefixes
Option C: Hybrid - Show Both Label and ID
Approach: Enhance prompt to show both human-readable labels and technical IDs.
Changes:
-
Update prompt template:
{% for class_id, class_def in classes.items() %} - **{{class_id}}** (label: "{{class_def.labels[0].value if class_def.labels else class_id}}"){% if class_def.comment %}: {{class_def.comment}}{% endif %} {% endfor %}Example output:
- **fo/Recipe** (label: "Recipe"): A Recipe is a combination... -
Update instructions:
When referencing classes: - Use the full prefixed ID (e.g., "fo/Recipe") in JSON output - The label (e.g., "Recipe") is for human understanding only
Pros:
- Clearest for LLM
- Preserves all information
- Explicit about what to use
Cons:
- Longer prompt
- More complex template
Implemented Approach
Simplified Entity-Relationship-Attribute Format - completely replaces the old triple-based format.
The new approach was chosen because:
- No Information Loss: Original URIs preserved correctly
- Simpler Logic: No transformation needed, direct dict lookups work
- Namespace Safety: Handles multiple ontologies without collisions
- Semantic Correctness: Maintains RDF/OWL semantics
Implementation Complete
What Was Built:
-
New Prompt Template (
prompts/ontology-extract-v2.txt)- ✅ Clear sections: Entity Types, Relationships, Attributes
- ✅ Example using full type identifiers (
fo/Recipe,fo/has_ingredient) - ✅ Instructions to use exact identifiers from schema
- ✅ New JSON format with entities/relationships/attributes arrays
-
Entity Normalization (
entity_normalizer.py)- ✅
normalize_entity_name()- Converts names to URI-safe format - ✅
normalize_type_identifier()- Handles slashes in types (fo/Recipe→fo-recipe) - ✅
build_entity_uri()- Creates unique URIs using (name, type) tuple - ✅
EntityRegistry- Tracks entities for deduplication
- ✅
-
JSON Parser (
simplified_parser.py)- ✅ Parses new format:
{entities: [...], relationships: [...], attributes: [...]} - ✅ Supports kebab-case and snake_case field names
- ✅ Returns structured dataclasses
- ✅ Graceful error handling with logging
- ✅ Parses new format:
-
Triple Converter (
triple_converter.py)- ✅
convert_entity()- Generates type + label triples automatically - ✅
convert_relationship()- Connects entity URIs via properties - ✅
convert_attribute()- Adds literal values - ✅ Looks up full URIs from ontology definitions
- ✅
-
Updated Main Processor (
extract.py)- ✅ Removed old triple-based extraction code
- ✅ Added
extract_with_simplified_format()method - ✅ Now exclusively uses new simplified format
- ✅ Calls prompt with
extract-with-ontologies-v2ID
Test Cases
Test 1: URI Preservation
# Given ontology class
classes = {"fo/Recipe": {"uri": "http://purl.org/ontology/fo/Recipe", ...}}
# When LLM returns
llm_output = {"subject": "x", "predicate": "rdf:type", "object": "fo/Recipe"}
# Then expanded URI should be
assert expanded == "http://purl.org/ontology/fo/Recipe"
# Not: "https://trustgraph.ai/ontology/food#Recipe"
Test 2: Multi-Ontology Collision
# Given two ontologies
ont1 = {"fo/Recipe": {...}}
ont2 = {"cooking/Recipe": {...}}
# LLM should use full prefix to disambiguate
llm_output = {"object": "fo/Recipe"} # Not just "Recipe"
Test 3: Entity Instance Format
# Given prompt with food ontology
# LLM should create instances like
{"subject": "recipe:cornish-pasty"} # Namespace-style
{"subject": "food:beef"} # Consistent prefix
Open Questions
-
Should entity instances use namespace prefixes?
- Current:
"recipe:cornish-pasty"(arbitrary) - Alternative: Use ontology prefix
"fo:cornish-pasty"? - Alternative: No prefix, expand in URI
"cornish-pasty"→ full URI?
- Current:
-
How to handle domain/range in prompt?
- Currently shows:
(Recipe → Food) - Should it be:
(fo/Recipe → fo/Food)?
- Currently shows:
-
Should we validate domain/range constraints?
- TODO comment at extract.py:470
- Would catch more errors but more complex
-
What about inverse properties and equivalences?
- Ontology has
owl:inverseOf,owl:equivalentClass - Not currently used in extraction
- Should they be?
- Ontology has
Success Metrics
- ✅ Zero URI information loss (100% preservation of original URIs)
- ✅ LLM output format matches input format
- ✅ No ambiguous examples in prompt
- ✅ Tests pass with multiple ontologies
- ✅ Improved extraction quality (measured by valid triple %)
Alternative Approach: Simplified Extraction Format
Philosophy
Instead of asking the LLM to understand RDF/OWL semantics, ask it to do what it's good at: find entities and relationships in text.
Let the code handle URI construction, RDF conversion, and semantic web formalities.
Example: Entity Classification
Input Text:
Cornish pasty is a traditional British pastry filled with meat and vegetables.
Ontology Schema (shown to LLM):
## Entity Types:
- Recipe: A recipe is a combination of ingredients and a method
- Food: A food is something that can be eaten
- Ingredient: An ingredient combines a quantity and a food
What LLM Returns (Simple JSON):
{
"entities": [
{
"entity": "Cornish pasty",
"type": "Recipe"
}
]
}
What Code Produces (RDF Triples):
# 1. Normalize entity name + type to ID (type prevents collisions)
entity_id = "recipe-cornish-pasty" # normalize("Cornish pasty", "Recipe")
entity_uri = "https://trustgraph.ai/food/recipe-cornish-pasty"
# Note: Same name, different type = different URI
# "Cornish pasty" (Recipe) → recipe-cornish-pasty
# "Cornish pasty" (Food) → food-cornish-pasty
# 2. Generate triples
triples = [
# Type triple
Triple(
s=Value(value=entity_uri, is_uri=True),
p=Value(value="http://www.w3.org/1999/02/22-rdf-syntax-ns#type", is_uri=True),
o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True)
),
# Label triple (automatic)
Triple(
s=Value(value=entity_uri, is_uri=True),
p=Value(value="http://www.w3.org/2000/01/rdf-schema#label", is_uri=True),
o=Value(value="Cornish pasty", is_uri=False)
)
]
Benefits
-
LLM doesn't need to:
- Understand URI syntax
- Invent identifier prefixes (
recipe:,ingredient:) - Know about
rdf:typeorrdfs:label - Construct semantic web identifiers
-
LLM just needs to:
- Find entities in text
- Map them to ontology classes
- Extract relationships and attributes
-
Code handles:
- URI normalization and construction
- RDF triple generation
- Automatic label assignment
- Namespace management
Why This Works Better
- Simpler prompt = less confusion = fewer errors
- Consistent IDs = code controls normalization rules
- Auto-generated labels = no missing rdfs:label triples
- LLM focuses on extraction = what it's actually good at
Example: Entity Relationships
Input Text:
Cornish pasty is a traditional British pastry filled with beef and potatoes.
Ontology Schema (shown to LLM):
## Entity Types:
- Recipe: A recipe is a combination of ingredients and a method
- Food: A food is something that can be eaten
- Ingredient: An ingredient combines a quantity and a food
## Relationships:
- has_ingredient: Relates a recipe to an ingredient it uses (Recipe → Ingredient)
- food: Relates an ingredient to the food that is required (Ingredient → Food)
What LLM Returns (Simple JSON):
{
"entities": [
{
"entity": "Cornish pasty",
"type": "Recipe"
},
{
"entity": "beef",
"type": "Food"
},
{
"entity": "potatoes",
"type": "Food"
}
],
"relationships": [
{
"subject": "Cornish pasty",
"subject-type": "Recipe",
"relation": "has_ingredient",
"object": "beef",
"object-type": "Food"
},
{
"subject": "Cornish pasty",
"subject-type": "Recipe",
"relation": "has_ingredient",
"object": "potatoes",
"object-type": "Food"
}
]
}
What Code Produces (RDF Triples):
# Normalize entity names to URIs
cornish_pasty_uri = "https://trustgraph.ai/food/cornish-pasty"
beef_uri = "https://trustgraph.ai/food/beef"
potatoes_uri = "https://trustgraph.ai/food/potatoes"
# Look up relation URI from ontology
has_ingredient_uri = "http://purl.org/ontology/fo/ingredients" # from fo/has_ingredient
triples = [
# Entity type triples (as before)
Triple(s=cornish_pasty_uri, p=rdf_type, o="http://purl.org/ontology/fo/Recipe"),
Triple(s=cornish_pasty_uri, p=rdfs_label, o="Cornish pasty"),
Triple(s=beef_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"),
Triple(s=beef_uri, p=rdfs_label, o="beef"),
Triple(s=potatoes_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"),
Triple(s=potatoes_uri, p=rdfs_label, o="potatoes"),
# Relationship triples
Triple(
s=Value(value=cornish_pasty_uri, is_uri=True),
p=Value(value=has_ingredient_uri, is_uri=True),
o=Value(value=beef_uri, is_uri=True)
),
Triple(
s=Value(value=cornish_pasty_uri, is_uri=True),
p=Value(value=has_ingredient_uri, is_uri=True),
o=Value(value=potatoes_uri, is_uri=True)
)
]
Key Points:
- LLM returns natural language entity names:
"Cornish pasty","beef","potatoes" - LLM includes types to disambiguate:
subject-type,object-type - LLM uses relation name from schema:
"has_ingredient" - Code derives consistent IDs using (name, type):
("Cornish pasty", "Recipe")→recipe-cornish-pasty - Code looks up relation URI from ontology:
fo/has_ingredient→ full URI - Same (name, type) tuple always gets same URI (deduplication)
Example: Entity Name Disambiguation
Problem: Same name can refer to different entity types.
Real-world case:
"Cornish pasty" can be:
- A Recipe (instructions for making it)
- A Food (the dish itself)
How It's Handled:
LLM returns both as separate entities:
{
"entities": [
{"entity": "Cornish pasty", "type": "Recipe"},
{"entity": "Cornish pasty", "type": "Food"}
],
"relationships": [
{
"subject": "Cornish pasty",
"subject-type": "Recipe",
"relation": "produces",
"object": "Cornish pasty",
"object-type": "Food"
}
]
}
Code Resolution:
# Different types → different URIs
recipe_uri = normalize("Cornish pasty", "Recipe")
# → "https://trustgraph.ai/food/recipe-cornish-pasty"
food_uri = normalize("Cornish pasty", "Food")
# → "https://trustgraph.ai/food/food-cornish-pasty"
# Relationship connects them correctly
triple = Triple(
s=recipe_uri, # The Recipe
p="http://purl.org/ontology/fo/produces",
o=food_uri # The Food
)
Why This Works:
- Type is included in ALL references (entities, relationships, attributes)
- Code uses
(name, type)tuple as lookup key - No ambiguity, no collisions
Example: Entity Attributes
Input Text:
This Cornish pasty recipe serves 4-6 people and takes 45 minutes to prepare.
Ontology Schema (shown to LLM):
## Entity Types:
- Recipe: A recipe is a combination of ingredients and a method
## Attributes:
- serves: Indicates what the recipe is intended to serve (Recipe → text)
- preparation_time: Time needed to prepare the recipe (Recipe → text)
What LLM Returns (Simple JSON):
{
"entities": [
{
"entity": "Cornish pasty recipe",
"type": "Recipe"
}
],
"attributes": [
{
"entity": "Cornish pasty recipe",
"entity-type": "Recipe",
"attribute": "serves",
"value": "4-6 people"
},
{
"entity": "Cornish pasty recipe",
"entity-type": "Recipe",
"attribute": "preparation_time",
"value": "45 minutes"
}
]
}
What Code Produces (RDF Triples):
# Normalize entity name to URI
recipe_uri = "https://trustgraph.ai/food/cornish-pasty-recipe"
# Look up attribute URIs from ontology
serves_uri = "http://purl.org/ontology/fo/serves" # from fo/serves
prep_time_uri = "http://purl.org/ontology/fo/preparation_time" # from fo/preparation_time
triples = [
# Entity type triple
Triple(
s=Value(value=recipe_uri, is_uri=True),
p=Value(value=rdf_type, is_uri=True),
o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True)
),
# Label triple (automatic)
Triple(
s=Value(value=recipe_uri, is_uri=True),
p=Value(value=rdfs_label, is_uri=True),
o=Value(value="Cornish pasty recipe", is_uri=False)
),
# Attribute triples (objects are literals, not URIs)
Triple(
s=Value(value=recipe_uri, is_uri=True),
p=Value(value=serves_uri, is_uri=True),
o=Value(value="4-6 people", is_uri=False) # Literal value!
),
Triple(
s=Value(value=recipe_uri, is_uri=True),
p=Value(value=prep_time_uri, is_uri=True),
o=Value(value="45 minutes", is_uri=False) # Literal value!
)
]
Key Points:
- LLM extracts literal values:
"4-6 people","45 minutes" - LLM includes entity type for disambiguation:
entity-type - LLM uses attribute name from schema:
"serves","preparation_time" - Code looks up attribute URI from ontology datatype properties
- Object is literal (
is_uri=False), not a URI reference - Values stay as natural text, no normalization needed
Difference from Relationships:
- Relationships: both subject and object are entities (URIs)
- Attributes: subject is entity (URI), object is literal value (string/number)
Complete Example: Entities + Relationships + Attributes
Input Text:
Cornish pasty is a savory pastry filled with beef and potatoes.
This recipe serves 4 people.
What LLM Returns:
{
"entities": [
{
"entity": "Cornish pasty",
"type": "Recipe"
},
{
"entity": "beef",
"type": "Food"
},
{
"entity": "potatoes",
"type": "Food"
}
],
"relationships": [
{
"subject": "Cornish pasty",
"subject-type": "Recipe",
"relation": "has_ingredient",
"object": "beef",
"object-type": "Food"
},
{
"subject": "Cornish pasty",
"subject-type": "Recipe",
"relation": "has_ingredient",
"object": "potatoes",
"object-type": "Food"
}
],
"attributes": [
{
"entity": "Cornish pasty",
"entity-type": "Recipe",
"attribute": "serves",
"value": "4 people"
}
]
}
Result: 11 RDF triples generated:
- 3 entity type triples (rdf:type)
- 3 entity label triples (rdfs:label) - automatic
- 2 relationship triples (has_ingredient)
- 1 attribute triple (serves)
All from simple, natural language extractions by the LLM!
References
- Current implementation:
trustgraph-flow/trustgraph/extract/kg/ontology/extract.py - Prompt template:
ontology-prompt.md - Test cases:
tests/unit/test_extract/test_ontology/ - Example ontology:
e2e/test-data/food.ontology