mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-04-25 08:26:21 +02:00

Feature/improve ontology extract (#576 )

* Tech spec to change ontology extraction

* Ontology extract refactoring

2025-12-03 13:36:10 +00:00

22 KiB

Raw Blame History

Ontology Knowledge Extraction - Phase 2 Refactor

Status: Draft Author: Analysis Session 2025-12-03 Related: ontology.md, ontorag.md

Overview

This document identifies inconsistencies in the current ontology-based knowledge extraction system and proposes a refactor to improve LLM performance and reduce information loss.

Current Implementation

How It Works Now

Ontology Loading (ontology_loader.py)
- Loads ontology JSON with keys like "fo/Recipe", "fo/Food", "fo/produces"
- Class IDs include namespace prefix in the key itself
- Example from food.ontology:
```
"classes": {
  "fo/Recipe": {
    "uri": "http://purl.org/ontology/fo/Recipe",
    "rdfs:comment": "A Recipe is a combination..."
  }
}
```
Prompt Construction (extract.py:299-307, ontology-prompt.md)
- Template receives classes, object_properties, datatype_properties dicts
- Template iterates: {% for class_id, class_def in classes.items() %}
- LLM sees: **fo/Recipe**: A Recipe is a combination...
- Example output format shows:
```
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"}
{"subject": "recipe:cornish-pasty", "predicate": "has_ingredient", "object": "ingredient:flour"}
```
Response Parsing (extract.py:382-428)
- Expects JSON array: [{"subject": "...", "predicate": "...", "object": "..."}]
- Validates against ontology subset
- Expands URIs via expand_uri() (extract.py:473-521)
URI Expansion (extract.py:473-521)
- Checks if value is in ontology_subset.classes dict
- If found, extracts URI from class definition
- If not found, constructs URI: f"https://trustgraph.ai/ontology/{ontology_id}#{value}"

Data Flow Example

Ontology JSON → Loader → Prompt:

"fo/Recipe" → classes["fo/Recipe"] → LLM sees "**fo/Recipe**"

LLM → Parser → Output:

"Recipe" → not in classes["fo/Recipe"] → constructs URI → LOSES original URI
"fo/Recipe" → found in classes → uses original URI → PRESERVES URI

Problems Identified

1. Inconsistent Examples in Prompt

Issue: The prompt template shows class IDs with prefixes (fo/Recipe) but the example output uses unprefixed class names (Recipe).

Location: ontology-prompt.md:5-52

## Ontology Classes:
- **fo/Recipe**: A Recipe is...

## Example Output:
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"}

Impact: LLM receives conflicting signals about what format to use.

2. Information Loss in URI Expansion

Issue: When LLM returns unprefixed class names following the example, expand_uri() can't find them in the ontology dict and constructs fallback URIs, losing the original proper URIs.

Location: extract.py:494-500

if value in ontology_subset.classes:  # Looks for "Recipe"
    class_def = ontology_subset.classes[value]  # But key is "fo/Recipe"
    if isinstance(class_def, dict) and 'uri' in class_def:
        return class_def['uri']  # Never reached!
return f"https://trustgraph.ai/ontology/{ontology_id}#{value}"  # Fallback

Impact:

Original URI: http://purl.org/ontology/fo/Recipe
Constructed URI: https://trustgraph.ai/ontology/food#Recipe
Semantic meaning lost, breaks interoperability

3. Ambiguous Entity Instance Format

Issue: No clear guidance on entity instance URI format.

Examples in prompt:

"recipe:cornish-pasty" (namespace-like prefix)
"ingredient:flour" (different prefix)

Actual behavior (extract.py:517-520):

# Treat as entity instance - construct unique URI
normalized = value.replace(" ", "-").lower()
return f"https://trustgraph.ai/{ontology_id}/{normalized}"

Impact: LLM must guess prefixing convention with no ontology context.

4. No Namespace Prefix Guidance

Issue: The ontology JSON contains namespace definitions (line 10-25 in food.ontology):

"namespaces": {
  "fo": "http://purl.org/ontology/fo/",
  "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
  ...
}

But these are never surfaced to the LLM. The LLM doesn't know:

What "fo" means
What prefix to use for entities
Which namespace applies to which elements

5. Labels Not Used in Prompt

Issue: Every class has rdfs:label fields (e.g., {"value": "Recipe", "lang": "en-gb"}), but the prompt template doesn't use them.

Current: Shows only class_id and comment

- **{{class_id}}**{% if class_def.comment %}: {{class_def.comment}}{% endif %}

Available but unused:

"rdfs:label": [{"value": "Recipe", "lang": "en-gb"}]

Impact: Could provide human-readable names alongside technical IDs.

Proposed Solutions

Option A: Normalize to Unprefixed IDs

Approach: Strip prefixes from class IDs before showing to LLM.

Changes:

Modify build_extraction_variables() to transform keys:

classes_for_prompt = {
    k.split('/')[-1]: v  # "fo/Recipe" → "Recipe"
    for k, v in ontology_subset.classes.items()
}

Update prompt example to match (already uses unprefixed names)

Modify expand_uri() to handle both formats:

# Try exact match first
if value in ontology_subset.classes:
    return ontology_subset.classes[value]['uri']

# Try with prefix
for prefix in ['fo/', 'rdf:', 'rdfs:']:
    prefixed = f"{prefix}{value}"
    if prefixed in ontology_subset.classes:
        return ontology_subset.classes[prefixed]['uri']

Pros:

Cleaner, more human-readable
Matches existing prompt examples
LLMs work better with simpler tokens

Cons:

Class name collisions if multiple ontologies have same class name
Loses namespace information
Requires fallback logic for lookups

Option B: Use Full Prefixed IDs Consistently

Approach: Update examples to use prefixed IDs matching what's shown in the class list.

Changes:

Update prompt example (ontology-prompt.md:46-52):

[
  {"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "fo/Recipe"},
  {"subject": "recipe:cornish-pasty", "predicate": "rdfs:label", "object": "Cornish Pasty"},
  {"subject": "recipe:cornish-pasty", "predicate": "fo/produces", "object": "food:cornish-pasty"},
  {"subject": "food:cornish-pasty", "predicate": "rdf:type", "object": "fo/Food"}
]

Add namespace explanation to prompt:

## Namespace Prefixes:
- **fo/**: Food Ontology (http://purl.org/ontology/fo/)
- **rdf:**: RDF Schema
- **rdfs:**: RDF Schema

Use these prefixes exactly as shown when referencing classes and properties.

Keep expand_uri() as-is (works correctly when matches found)

Pros:

Input = Output consistency
No information loss
Preserves namespace semantics
Works with multiple ontologies

Cons:

More verbose tokens for LLM
Requires LLM to track prefixes

Option C: Hybrid - Show Both Label and ID

Approach: Enhance prompt to show both human-readable labels and technical IDs.

Changes:

Update prompt template:

{% for class_id, class_def in classes.items() %}
- **{{class_id}}** (label: "{{class_def.labels[0].value if class_def.labels else class_id}}"){% if class_def.comment %}: {{class_def.comment}}{% endif %}
{% endfor %}

Example output:

- **fo/Recipe** (label: "Recipe"): A Recipe is a combination...

Update instructions:

When referencing classes:
- Use the full prefixed ID (e.g., "fo/Recipe") in JSON output
- The label (e.g., "Recipe") is for human understanding only

Pros:

Clearest for LLM
Preserves all information
Explicit about what to use

Cons:

Longer prompt
More complex template

Implemented Approach

Simplified Entity-Relationship-Attribute Format - completely replaces the old triple-based format.

The new approach was chosen because:

No Information Loss: Original URIs preserved correctly
Simpler Logic: No transformation needed, direct dict lookups work
Namespace Safety: Handles multiple ontologies without collisions
Semantic Correctness: Maintains RDF/OWL semantics

Implementation Complete

What Was Built:

New Prompt Template (prompts/ontology-extract-v2.txt)
- ✅ Clear sections: Entity Types, Relationships, Attributes
- ✅ Example using full type identifiers (fo/Recipe, fo/has_ingredient)
- ✅ Instructions to use exact identifiers from schema
- ✅ New JSON format with entities/relationships/attributes arrays
Entity Normalization (entity_normalizer.py)
- ✅ normalize_entity_name() - Converts names to URI-safe format
- ✅ normalize_type_identifier() - Handles slashes in types (fo/Recipe → fo-recipe)
- ✅ build_entity_uri() - Creates unique URIs using (name, type) tuple
- ✅ EntityRegistry - Tracks entities for deduplication
JSON Parser (simplified_parser.py)
- ✅ Parses new format: {entities: [...], relationships: [...], attributes: [...]}
- ✅ Supports kebab-case and snake_case field names
- ✅ Returns structured dataclasses
- ✅ Graceful error handling with logging
Triple Converter (triple_converter.py)
- ✅ convert_entity() - Generates type + label triples automatically
- ✅ convert_relationship() - Connects entity URIs via properties
- ✅ convert_attribute() - Adds literal values
- ✅ Looks up full URIs from ontology definitions
Updated Main Processor (extract.py)
- ✅ Removed old triple-based extraction code
- ✅ Added extract_with_simplified_format() method
- ✅ Now exclusively uses new simplified format
- ✅ Calls prompt with extract-with-ontologies-v2 ID

Test Cases

Test 1: URI Preservation

# Given ontology class
classes = {"fo/Recipe": {"uri": "http://purl.org/ontology/fo/Recipe", ...}}

# When LLM returns
llm_output = {"subject": "x", "predicate": "rdf:type", "object": "fo/Recipe"}

# Then expanded URI should be
assert expanded == "http://purl.org/ontology/fo/Recipe"
# Not: "https://trustgraph.ai/ontology/food#Recipe"

Test 2: Multi-Ontology Collision

# Given two ontologies
ont1 = {"fo/Recipe": {...}}
ont2 = {"cooking/Recipe": {...}}

# LLM should use full prefix to disambiguate
llm_output = {"object": "fo/Recipe"}  # Not just "Recipe"

Test 3: Entity Instance Format

# Given prompt with food ontology
# LLM should create instances like
{"subject": "recipe:cornish-pasty"}  # Namespace-style
{"subject": "food:beef"}              # Consistent prefix

Open Questions

Should entity instances use namespace prefixes?
- Current: "recipe:cornish-pasty" (arbitrary)
- Alternative: Use ontology prefix "fo:cornish-pasty"?
- Alternative: No prefix, expand in URI "cornish-pasty" → full URI?
How to handle domain/range in prompt?
- Currently shows: (Recipe → Food)
- Should it be: (fo/Recipe → fo/Food)?
Should we validate domain/range constraints?
- TODO comment at extract.py:470
- Would catch more errors but more complex
What about inverse properties and equivalences?
- Ontology has owl:inverseOf, owl:equivalentClass
- Not currently used in extraction
- Should they be?

Success Metrics

✅ Zero URI information loss (100% preservation of original URIs)
✅ LLM output format matches input format
✅ No ambiguous examples in prompt
✅ Tests pass with multiple ontologies
✅ Improved extraction quality (measured by valid triple %)

Alternative Approach: Simplified Extraction Format

Philosophy

Instead of asking the LLM to understand RDF/OWL semantics, ask it to do what it's good at: find entities and relationships in text.

Let the code handle URI construction, RDF conversion, and semantic web formalities.

Example: Entity Classification

Input Text:

Cornish pasty is a traditional British pastry filled with meat and vegetables.

Ontology Schema (shown to LLM):

## Entity Types:
- Recipe: A recipe is a combination of ingredients and a method
- Food: A food is something that can be eaten
- Ingredient: An ingredient combines a quantity and a food

What LLM Returns (Simple JSON):

{
  "entities": [
    {
      "entity": "Cornish pasty",
      "type": "Recipe"
    }
  ]
}

What Code Produces (RDF Triples):

# 1. Normalize entity name + type to ID (type prevents collisions)
entity_id = "recipe-cornish-pasty"  # normalize("Cornish pasty", "Recipe")
entity_uri = "https://trustgraph.ai/food/recipe-cornish-pasty"

# Note: Same name, different type = different URI
# "Cornish pasty" (Recipe) → recipe-cornish-pasty
# "Cornish pasty" (Food) → food-cornish-pasty

# 2. Generate triples
triples = [
    # Type triple
    Triple(
        s=Value(value=entity_uri, is_uri=True),
        p=Value(value="http://www.w3.org/1999/02/22-rdf-syntax-ns#type", is_uri=True),
        o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True)
    ),
    # Label triple (automatic)
    Triple(
        s=Value(value=entity_uri, is_uri=True),
        p=Value(value="http://www.w3.org/2000/01/rdf-schema#label", is_uri=True),
        o=Value(value="Cornish pasty", is_uri=False)
    )
]

Benefits

LLM doesn't need to:
- Understand URI syntax
- Invent identifier prefixes (recipe:, ingredient:)
- Know about rdf:type or rdfs:label
- Construct semantic web identifiers
LLM just needs to:
- Find entities in text
- Map them to ontology classes
- Extract relationships and attributes
Code handles:
- URI normalization and construction
- RDF triple generation
- Automatic label assignment
- Namespace management

Why This Works Better

Simpler prompt = less confusion = fewer errors
Consistent IDs = code controls normalization rules
Auto-generated labels = no missing rdfs:label triples
LLM focuses on extraction = what it's actually good at

Example: Entity Relationships

Input Text:

Cornish pasty is a traditional British pastry filled with beef and potatoes.

Ontology Schema (shown to LLM):

## Entity Types:
- Recipe: A recipe is a combination of ingredients and a method
- Food: A food is something that can be eaten
- Ingredient: An ingredient combines a quantity and a food

## Relationships:
- has_ingredient: Relates a recipe to an ingredient it uses (Recipe → Ingredient)
- food: Relates an ingredient to the food that is required (Ingredient → Food)

What LLM Returns (Simple JSON):

{
  "entities": [
    {
      "entity": "Cornish pasty",
      "type": "Recipe"
    },
    {
      "entity": "beef",
      "type": "Food"
    },
    {
      "entity": "potatoes",
      "type": "Food"
    }
  ],
  "relationships": [
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "has_ingredient",
      "object": "beef",
      "object-type": "Food"
    },
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "has_ingredient",
      "object": "potatoes",
      "object-type": "Food"
    }
  ]
}

What Code Produces (RDF Triples):

# Normalize entity names to URIs
cornish_pasty_uri = "https://trustgraph.ai/food/cornish-pasty"
beef_uri = "https://trustgraph.ai/food/beef"
potatoes_uri = "https://trustgraph.ai/food/potatoes"

# Look up relation URI from ontology
has_ingredient_uri = "http://purl.org/ontology/fo/ingredients"  # from fo/has_ingredient

triples = [
    # Entity type triples (as before)
    Triple(s=cornish_pasty_uri, p=rdf_type, o="http://purl.org/ontology/fo/Recipe"),
    Triple(s=cornish_pasty_uri, p=rdfs_label, o="Cornish pasty"),

    Triple(s=beef_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"),
    Triple(s=beef_uri, p=rdfs_label, o="beef"),

    Triple(s=potatoes_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"),
    Triple(s=potatoes_uri, p=rdfs_label, o="potatoes"),

    # Relationship triples
    Triple(
        s=Value(value=cornish_pasty_uri, is_uri=True),
        p=Value(value=has_ingredient_uri, is_uri=True),
        o=Value(value=beef_uri, is_uri=True)
    ),
    Triple(
        s=Value(value=cornish_pasty_uri, is_uri=True),
        p=Value(value=has_ingredient_uri, is_uri=True),
        o=Value(value=potatoes_uri, is_uri=True)
    )
]

Key Points:

LLM returns natural language entity names: "Cornish pasty", "beef", "potatoes"
LLM includes types to disambiguate: subject-type, object-type
LLM uses relation name from schema: "has_ingredient"
Code derives consistent IDs using (name, type): ("Cornish pasty", "Recipe") → recipe-cornish-pasty
Code looks up relation URI from ontology: fo/has_ingredient → full URI
Same (name, type) tuple always gets same URI (deduplication)

Example: Entity Name Disambiguation

Problem: Same name can refer to different entity types.

Real-world case:

"Cornish pasty" can be:
- A Recipe (instructions for making it)
- A Food (the dish itself)

How It's Handled:

LLM returns both as separate entities:

{
  "entities": [
    {"entity": "Cornish pasty", "type": "Recipe"},
    {"entity": "Cornish pasty", "type": "Food"}
  ],
  "relationships": [
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "produces",
      "object": "Cornish pasty",
      "object-type": "Food"
    }
  ]
}

Code Resolution:

# Different types → different URIs
recipe_uri = normalize("Cornish pasty", "Recipe")
# → "https://trustgraph.ai/food/recipe-cornish-pasty"

food_uri = normalize("Cornish pasty", "Food")
# → "https://trustgraph.ai/food/food-cornish-pasty"

# Relationship connects them correctly
triple = Triple(
    s=recipe_uri,  # The Recipe
    p="http://purl.org/ontology/fo/produces",
    o=food_uri     # The Food
)

Why This Works:

Type is included in ALL references (entities, relationships, attributes)
Code uses (name, type) tuple as lookup key
No ambiguity, no collisions

Example: Entity Attributes

Input Text:

This Cornish pasty recipe serves 4-6 people and takes 45 minutes to prepare.

Ontology Schema (shown to LLM):

## Entity Types:
- Recipe: A recipe is a combination of ingredients and a method

## Attributes:
- serves: Indicates what the recipe is intended to serve (Recipe → text)
- preparation_time: Time needed to prepare the recipe (Recipe → text)

What LLM Returns (Simple JSON):

{
  "entities": [
    {
      "entity": "Cornish pasty recipe",
      "type": "Recipe"
    }
  ],
  "attributes": [
    {
      "entity": "Cornish pasty recipe",
      "entity-type": "Recipe",
      "attribute": "serves",
      "value": "4-6 people"
    },
    {
      "entity": "Cornish pasty recipe",
      "entity-type": "Recipe",
      "attribute": "preparation_time",
      "value": "45 minutes"
    }
  ]
}

What Code Produces (RDF Triples):

# Normalize entity name to URI
recipe_uri = "https://trustgraph.ai/food/cornish-pasty-recipe"

# Look up attribute URIs from ontology
serves_uri = "http://purl.org/ontology/fo/serves"  # from fo/serves
prep_time_uri = "http://purl.org/ontology/fo/preparation_time"  # from fo/preparation_time

triples = [
    # Entity type triple
    Triple(
        s=Value(value=recipe_uri, is_uri=True),
        p=Value(value=rdf_type, is_uri=True),
        o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True)
    ),

    # Label triple (automatic)
    Triple(
        s=Value(value=recipe_uri, is_uri=True),
        p=Value(value=rdfs_label, is_uri=True),
        o=Value(value="Cornish pasty recipe", is_uri=False)
    ),

    # Attribute triples (objects are literals, not URIs)
    Triple(
        s=Value(value=recipe_uri, is_uri=True),
        p=Value(value=serves_uri, is_uri=True),
        o=Value(value="4-6 people", is_uri=False)  # Literal value!
    ),
    Triple(
        s=Value(value=recipe_uri, is_uri=True),
        p=Value(value=prep_time_uri, is_uri=True),
        o=Value(value="45 minutes", is_uri=False)  # Literal value!
    )
]

Key Points:

LLM extracts literal values: "4-6 people", "45 minutes"
LLM includes entity type for disambiguation: entity-type
LLM uses attribute name from schema: "serves", "preparation_time"
Code looks up attribute URI from ontology datatype properties
Object is literal (is_uri=False), not a URI reference
Values stay as natural text, no normalization needed

Difference from Relationships:

Relationships: both subject and object are entities (URIs)
Attributes: subject is entity (URI), object is literal value (string/number)

Complete Example: Entities + Relationships + Attributes

Input Text:

Cornish pasty is a savory pastry filled with beef and potatoes.
This recipe serves 4 people.

What LLM Returns:

{
  "entities": [
    {
      "entity": "Cornish pasty",
      "type": "Recipe"
    },
    {
      "entity": "beef",
      "type": "Food"
    },
    {
      "entity": "potatoes",
      "type": "Food"
    }
  ],
  "relationships": [
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "has_ingredient",
      "object": "beef",
      "object-type": "Food"
    },
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "has_ingredient",
      "object": "potatoes",
      "object-type": "Food"
    }
  ],
  "attributes": [
    {
      "entity": "Cornish pasty",
      "entity-type": "Recipe",
      "attribute": "serves",
      "value": "4 people"
    }
  ]
}

Result: 11 RDF triples generated:

3 entity type triples (rdf:type)
3 entity label triples (rdfs:label) - automatic
2 relationship triples (has_ingredient)
1 attribute triple (serves)

All from simple, natural language extractions by the LLM!

References

Current implementation: trustgraph-flow/trustgraph/extract/kg/ontology/extract.py
Prompt template: ontology-prompt.md
Test cases: tests/unit/test_extract/test_ontology/
Example ontology: e2e/test-data/food.ontology

22 KiB Raw Blame History

Ontology Knowledge Extraction - Phase 2 Refactor

Overview

Current Implementation

How It Works Now

Data Flow Example

Problems Identified

1. Inconsistent Examples in Prompt

2. Information Loss in URI Expansion

3. Ambiguous Entity Instance Format

4. No Namespace Prefix Guidance

5. Labels Not Used in Prompt

Proposed Solutions

Option A: Normalize to Unprefixed IDs

Option B: Use Full Prefixed IDs Consistently

Option C: Hybrid - Show Both Label and ID

Implemented Approach

Implementation Complete

What Was Built:

Test Cases

Test 1: URI Preservation

Test 2: Multi-Ontology Collision

Test 3: Entity Instance Format

Open Questions

Success Metrics

Alternative Approach: Simplified Extraction Format

Philosophy

Example: Entity Classification

Benefits

Why This Works Better

Example: Entity Relationships

Example: Entity Name Disambiguation

Example: Entity Attributes

Complete Example: Entities + Relationships + Attributes

References

22 KiB

Raw Blame History