trustgraph/docs/tech-specs/ontology-extract-phase-2.md
cybermaggedon b957004db9
Feature/improve ontology extract (#576)
* Tech spec to change ontology extraction

* Ontology extract refactoring
2025-12-03 13:36:10 +00:00

22 KiB

Ontology Knowledge Extraction - Phase 2 Refactor

Status: Draft Author: Analysis Session 2025-12-03 Related: ontology.md, ontorag.md

Overview

This document identifies inconsistencies in the current ontology-based knowledge extraction system and proposes a refactor to improve LLM performance and reduce information loss.

Current Implementation

How It Works Now

  1. Ontology Loading (ontology_loader.py)

    • Loads ontology JSON with keys like "fo/Recipe", "fo/Food", "fo/produces"
    • Class IDs include namespace prefix in the key itself
    • Example from food.ontology:
      "classes": {
        "fo/Recipe": {
          "uri": "http://purl.org/ontology/fo/Recipe",
          "rdfs:comment": "A Recipe is a combination..."
        }
      }
      
  2. Prompt Construction (extract.py:299-307, ontology-prompt.md)

    • Template receives classes, object_properties, datatype_properties dicts
    • Template iterates: {% for class_id, class_def in classes.items() %}
    • LLM sees: **fo/Recipe**: A Recipe is a combination...
    • Example output format shows:
      {"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"}
      {"subject": "recipe:cornish-pasty", "predicate": "has_ingredient", "object": "ingredient:flour"}
      
  3. Response Parsing (extract.py:382-428)

    • Expects JSON array: [{"subject": "...", "predicate": "...", "object": "..."}]
    • Validates against ontology subset
    • Expands URIs via expand_uri() (extract.py:473-521)
  4. URI Expansion (extract.py:473-521)

    • Checks if value is in ontology_subset.classes dict
    • If found, extracts URI from class definition
    • If not found, constructs URI: f"https://trustgraph.ai/ontology/{ontology_id}#{value}"

Data Flow Example

Ontology JSON → Loader → Prompt:

"fo/Recipe" → classes["fo/Recipe"] → LLM sees "**fo/Recipe**"

LLM → Parser → Output:

"Recipe" → not in classes["fo/Recipe"] → constructs URI → LOSES original URI
"fo/Recipe" → found in classes → uses original URI → PRESERVES URI

Problems Identified

1. Inconsistent Examples in Prompt

Issue: The prompt template shows class IDs with prefixes (fo/Recipe) but the example output uses unprefixed class names (Recipe).

Location: ontology-prompt.md:5-52

## Ontology Classes:
- **fo/Recipe**: A Recipe is...

## Example Output:
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"}

Impact: LLM receives conflicting signals about what format to use.

2. Information Loss in URI Expansion

Issue: When LLM returns unprefixed class names following the example, expand_uri() can't find them in the ontology dict and constructs fallback URIs, losing the original proper URIs.

Location: extract.py:494-500

if value in ontology_subset.classes:  # Looks for "Recipe"
    class_def = ontology_subset.classes[value]  # But key is "fo/Recipe"
    if isinstance(class_def, dict) and 'uri' in class_def:
        return class_def['uri']  # Never reached!
return f"https://trustgraph.ai/ontology/{ontology_id}#{value}"  # Fallback

Impact:

  • Original URI: http://purl.org/ontology/fo/Recipe
  • Constructed URI: https://trustgraph.ai/ontology/food#Recipe
  • Semantic meaning lost, breaks interoperability

3. Ambiguous Entity Instance Format

Issue: No clear guidance on entity instance URI format.

Examples in prompt:

  • "recipe:cornish-pasty" (namespace-like prefix)
  • "ingredient:flour" (different prefix)

Actual behavior (extract.py:517-520):

# Treat as entity instance - construct unique URI
normalized = value.replace(" ", "-").lower()
return f"https://trustgraph.ai/{ontology_id}/{normalized}"

Impact: LLM must guess prefixing convention with no ontology context.

4. No Namespace Prefix Guidance

Issue: The ontology JSON contains namespace definitions (line 10-25 in food.ontology):

"namespaces": {
  "fo": "http://purl.org/ontology/fo/",
  "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
  ...
}

But these are never surfaced to the LLM. The LLM doesn't know:

  • What "fo" means
  • What prefix to use for entities
  • Which namespace applies to which elements

5. Labels Not Used in Prompt

Issue: Every class has rdfs:label fields (e.g., {"value": "Recipe", "lang": "en-gb"}), but the prompt template doesn't use them.

Current: Shows only class_id and comment

- **{{class_id}}**{% if class_def.comment %}: {{class_def.comment}}{% endif %}

Available but unused:

"rdfs:label": [{"value": "Recipe", "lang": "en-gb"}]

Impact: Could provide human-readable names alongside technical IDs.

Proposed Solutions

Option A: Normalize to Unprefixed IDs

Approach: Strip prefixes from class IDs before showing to LLM.

Changes:

  1. Modify build_extraction_variables() to transform keys:

    classes_for_prompt = {
        k.split('/')[-1]: v  # "fo/Recipe" → "Recipe"
        for k, v in ontology_subset.classes.items()
    }
    
  2. Update prompt example to match (already uses unprefixed names)

  3. Modify expand_uri() to handle both formats:

    # Try exact match first
    if value in ontology_subset.classes:
        return ontology_subset.classes[value]['uri']
    
    # Try with prefix
    for prefix in ['fo/', 'rdf:', 'rdfs:']:
        prefixed = f"{prefix}{value}"
        if prefixed in ontology_subset.classes:
            return ontology_subset.classes[prefixed]['uri']
    

Pros:

  • Cleaner, more human-readable
  • Matches existing prompt examples
  • LLMs work better with simpler tokens

Cons:

  • Class name collisions if multiple ontologies have same class name
  • Loses namespace information
  • Requires fallback logic for lookups

Option B: Use Full Prefixed IDs Consistently

Approach: Update examples to use prefixed IDs matching what's shown in the class list.

Changes:

  1. Update prompt example (ontology-prompt.md:46-52):

    [
      {"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "fo/Recipe"},
      {"subject": "recipe:cornish-pasty", "predicate": "rdfs:label", "object": "Cornish Pasty"},
      {"subject": "recipe:cornish-pasty", "predicate": "fo/produces", "object": "food:cornish-pasty"},
      {"subject": "food:cornish-pasty", "predicate": "rdf:type", "object": "fo/Food"}
    ]
    
  2. Add namespace explanation to prompt:

    ## Namespace Prefixes:
    - **fo/**: Food Ontology (http://purl.org/ontology/fo/)
    - **rdf:**: RDF Schema
    - **rdfs:**: RDF Schema
    
    Use these prefixes exactly as shown when referencing classes and properties.
    
  3. Keep expand_uri() as-is (works correctly when matches found)

Pros:

  • Input = Output consistency
  • No information loss
  • Preserves namespace semantics
  • Works with multiple ontologies

Cons:

  • More verbose tokens for LLM
  • Requires LLM to track prefixes

Option C: Hybrid - Show Both Label and ID

Approach: Enhance prompt to show both human-readable labels and technical IDs.

Changes:

  1. Update prompt template:

    {% for class_id, class_def in classes.items() %}
    - **{{class_id}}** (label: "{{class_def.labels[0].value if class_def.labels else class_id}}"){% if class_def.comment %}: {{class_def.comment}}{% endif %}
    {% endfor %}
    

    Example output:

    - **fo/Recipe** (label: "Recipe"): A Recipe is a combination...
    
  2. Update instructions:

    When referencing classes:
    - Use the full prefixed ID (e.g., "fo/Recipe") in JSON output
    - The label (e.g., "Recipe") is for human understanding only
    

Pros:

  • Clearest for LLM
  • Preserves all information
  • Explicit about what to use

Cons:

  • Longer prompt
  • More complex template

Implemented Approach

Simplified Entity-Relationship-Attribute Format - completely replaces the old triple-based format.

The new approach was chosen because:

  1. No Information Loss: Original URIs preserved correctly
  2. Simpler Logic: No transformation needed, direct dict lookups work
  3. Namespace Safety: Handles multiple ontologies without collisions
  4. Semantic Correctness: Maintains RDF/OWL semantics

Implementation Complete

What Was Built:

  1. New Prompt Template (prompts/ontology-extract-v2.txt)

    • Clear sections: Entity Types, Relationships, Attributes
    • Example using full type identifiers (fo/Recipe, fo/has_ingredient)
    • Instructions to use exact identifiers from schema
    • New JSON format with entities/relationships/attributes arrays
  2. Entity Normalization (entity_normalizer.py)

    • normalize_entity_name() - Converts names to URI-safe format
    • normalize_type_identifier() - Handles slashes in types (fo/Recipefo-recipe)
    • build_entity_uri() - Creates unique URIs using (name, type) tuple
    • EntityRegistry - Tracks entities for deduplication
  3. JSON Parser (simplified_parser.py)

    • Parses new format: {entities: [...], relationships: [...], attributes: [...]}
    • Supports kebab-case and snake_case field names
    • Returns structured dataclasses
    • Graceful error handling with logging
  4. Triple Converter (triple_converter.py)

    • convert_entity() - Generates type + label triples automatically
    • convert_relationship() - Connects entity URIs via properties
    • convert_attribute() - Adds literal values
    • Looks up full URIs from ontology definitions
  5. Updated Main Processor (extract.py)

    • Removed old triple-based extraction code
    • Added extract_with_simplified_format() method
    • Now exclusively uses new simplified format
    • Calls prompt with extract-with-ontologies-v2 ID

Test Cases

Test 1: URI Preservation

# Given ontology class
classes = {"fo/Recipe": {"uri": "http://purl.org/ontology/fo/Recipe", ...}}

# When LLM returns
llm_output = {"subject": "x", "predicate": "rdf:type", "object": "fo/Recipe"}

# Then expanded URI should be
assert expanded == "http://purl.org/ontology/fo/Recipe"
# Not: "https://trustgraph.ai/ontology/food#Recipe"

Test 2: Multi-Ontology Collision

# Given two ontologies
ont1 = {"fo/Recipe": {...}}
ont2 = {"cooking/Recipe": {...}}

# LLM should use full prefix to disambiguate
llm_output = {"object": "fo/Recipe"}  # Not just "Recipe"

Test 3: Entity Instance Format

# Given prompt with food ontology
# LLM should create instances like
{"subject": "recipe:cornish-pasty"}  # Namespace-style
{"subject": "food:beef"}              # Consistent prefix

Open Questions

  1. Should entity instances use namespace prefixes?

    • Current: "recipe:cornish-pasty" (arbitrary)
    • Alternative: Use ontology prefix "fo:cornish-pasty"?
    • Alternative: No prefix, expand in URI "cornish-pasty" → full URI?
  2. How to handle domain/range in prompt?

    • Currently shows: (Recipe → Food)
    • Should it be: (fo/Recipe → fo/Food)?
  3. Should we validate domain/range constraints?

    • TODO comment at extract.py:470
    • Would catch more errors but more complex
  4. What about inverse properties and equivalences?

    • Ontology has owl:inverseOf, owl:equivalentClass
    • Not currently used in extraction
    • Should they be?

Success Metrics

  • Zero URI information loss (100% preservation of original URIs)
  • LLM output format matches input format
  • No ambiguous examples in prompt
  • Tests pass with multiple ontologies
  • Improved extraction quality (measured by valid triple %)

Alternative Approach: Simplified Extraction Format

Philosophy

Instead of asking the LLM to understand RDF/OWL semantics, ask it to do what it's good at: find entities and relationships in text.

Let the code handle URI construction, RDF conversion, and semantic web formalities.

Example: Entity Classification

Input Text:

Cornish pasty is a traditional British pastry filled with meat and vegetables.

Ontology Schema (shown to LLM):

## Entity Types:
- Recipe: A recipe is a combination of ingredients and a method
- Food: A food is something that can be eaten
- Ingredient: An ingredient combines a quantity and a food

What LLM Returns (Simple JSON):

{
  "entities": [
    {
      "entity": "Cornish pasty",
      "type": "Recipe"
    }
  ]
}

What Code Produces (RDF Triples):

# 1. Normalize entity name + type to ID (type prevents collisions)
entity_id = "recipe-cornish-pasty"  # normalize("Cornish pasty", "Recipe")
entity_uri = "https://trustgraph.ai/food/recipe-cornish-pasty"

# Note: Same name, different type = different URI
# "Cornish pasty" (Recipe) → recipe-cornish-pasty
# "Cornish pasty" (Food) → food-cornish-pasty

# 2. Generate triples
triples = [
    # Type triple
    Triple(
        s=Value(value=entity_uri, is_uri=True),
        p=Value(value="http://www.w3.org/1999/02/22-rdf-syntax-ns#type", is_uri=True),
        o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True)
    ),
    # Label triple (automatic)
    Triple(
        s=Value(value=entity_uri, is_uri=True),
        p=Value(value="http://www.w3.org/2000/01/rdf-schema#label", is_uri=True),
        o=Value(value="Cornish pasty", is_uri=False)
    )
]

Benefits

  1. LLM doesn't need to:

    • Understand URI syntax
    • Invent identifier prefixes (recipe:, ingredient:)
    • Know about rdf:type or rdfs:label
    • Construct semantic web identifiers
  2. LLM just needs to:

    • Find entities in text
    • Map them to ontology classes
    • Extract relationships and attributes
  3. Code handles:

    • URI normalization and construction
    • RDF triple generation
    • Automatic label assignment
    • Namespace management

Why This Works Better

  • Simpler prompt = less confusion = fewer errors
  • Consistent IDs = code controls normalization rules
  • Auto-generated labels = no missing rdfs:label triples
  • LLM focuses on extraction = what it's actually good at

Example: Entity Relationships

Input Text:

Cornish pasty is a traditional British pastry filled with beef and potatoes.

Ontology Schema (shown to LLM):

## Entity Types:
- Recipe: A recipe is a combination of ingredients and a method
- Food: A food is something that can be eaten
- Ingredient: An ingredient combines a quantity and a food

## Relationships:
- has_ingredient: Relates a recipe to an ingredient it uses (Recipe → Ingredient)
- food: Relates an ingredient to the food that is required (Ingredient → Food)

What LLM Returns (Simple JSON):

{
  "entities": [
    {
      "entity": "Cornish pasty",
      "type": "Recipe"
    },
    {
      "entity": "beef",
      "type": "Food"
    },
    {
      "entity": "potatoes",
      "type": "Food"
    }
  ],
  "relationships": [
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "has_ingredient",
      "object": "beef",
      "object-type": "Food"
    },
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "has_ingredient",
      "object": "potatoes",
      "object-type": "Food"
    }
  ]
}

What Code Produces (RDF Triples):

# Normalize entity names to URIs
cornish_pasty_uri = "https://trustgraph.ai/food/cornish-pasty"
beef_uri = "https://trustgraph.ai/food/beef"
potatoes_uri = "https://trustgraph.ai/food/potatoes"

# Look up relation URI from ontology
has_ingredient_uri = "http://purl.org/ontology/fo/ingredients"  # from fo/has_ingredient

triples = [
    # Entity type triples (as before)
    Triple(s=cornish_pasty_uri, p=rdf_type, o="http://purl.org/ontology/fo/Recipe"),
    Triple(s=cornish_pasty_uri, p=rdfs_label, o="Cornish pasty"),

    Triple(s=beef_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"),
    Triple(s=beef_uri, p=rdfs_label, o="beef"),

    Triple(s=potatoes_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"),
    Triple(s=potatoes_uri, p=rdfs_label, o="potatoes"),

    # Relationship triples
    Triple(
        s=Value(value=cornish_pasty_uri, is_uri=True),
        p=Value(value=has_ingredient_uri, is_uri=True),
        o=Value(value=beef_uri, is_uri=True)
    ),
    Triple(
        s=Value(value=cornish_pasty_uri, is_uri=True),
        p=Value(value=has_ingredient_uri, is_uri=True),
        o=Value(value=potatoes_uri, is_uri=True)
    )
]

Key Points:

  • LLM returns natural language entity names: "Cornish pasty", "beef", "potatoes"
  • LLM includes types to disambiguate: subject-type, object-type
  • LLM uses relation name from schema: "has_ingredient"
  • Code derives consistent IDs using (name, type): ("Cornish pasty", "Recipe")recipe-cornish-pasty
  • Code looks up relation URI from ontology: fo/has_ingredient → full URI
  • Same (name, type) tuple always gets same URI (deduplication)

Example: Entity Name Disambiguation

Problem: Same name can refer to different entity types.

Real-world case:

"Cornish pasty" can be:
- A Recipe (instructions for making it)
- A Food (the dish itself)

How It's Handled:

LLM returns both as separate entities:

{
  "entities": [
    {"entity": "Cornish pasty", "type": "Recipe"},
    {"entity": "Cornish pasty", "type": "Food"}
  ],
  "relationships": [
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "produces",
      "object": "Cornish pasty",
      "object-type": "Food"
    }
  ]
}

Code Resolution:

# Different types → different URIs
recipe_uri = normalize("Cornish pasty", "Recipe")
# → "https://trustgraph.ai/food/recipe-cornish-pasty"

food_uri = normalize("Cornish pasty", "Food")
# → "https://trustgraph.ai/food/food-cornish-pasty"

# Relationship connects them correctly
triple = Triple(
    s=recipe_uri,  # The Recipe
    p="http://purl.org/ontology/fo/produces",
    o=food_uri     # The Food
)

Why This Works:

  • Type is included in ALL references (entities, relationships, attributes)
  • Code uses (name, type) tuple as lookup key
  • No ambiguity, no collisions

Example: Entity Attributes

Input Text:

This Cornish pasty recipe serves 4-6 people and takes 45 minutes to prepare.

Ontology Schema (shown to LLM):

## Entity Types:
- Recipe: A recipe is a combination of ingredients and a method

## Attributes:
- serves: Indicates what the recipe is intended to serve (Recipe → text)
- preparation_time: Time needed to prepare the recipe (Recipe → text)

What LLM Returns (Simple JSON):

{
  "entities": [
    {
      "entity": "Cornish pasty recipe",
      "type": "Recipe"
    }
  ],
  "attributes": [
    {
      "entity": "Cornish pasty recipe",
      "entity-type": "Recipe",
      "attribute": "serves",
      "value": "4-6 people"
    },
    {
      "entity": "Cornish pasty recipe",
      "entity-type": "Recipe",
      "attribute": "preparation_time",
      "value": "45 minutes"
    }
  ]
}

What Code Produces (RDF Triples):

# Normalize entity name to URI
recipe_uri = "https://trustgraph.ai/food/cornish-pasty-recipe"

# Look up attribute URIs from ontology
serves_uri = "http://purl.org/ontology/fo/serves"  # from fo/serves
prep_time_uri = "http://purl.org/ontology/fo/preparation_time"  # from fo/preparation_time

triples = [
    # Entity type triple
    Triple(
        s=Value(value=recipe_uri, is_uri=True),
        p=Value(value=rdf_type, is_uri=True),
        o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True)
    ),

    # Label triple (automatic)
    Triple(
        s=Value(value=recipe_uri, is_uri=True),
        p=Value(value=rdfs_label, is_uri=True),
        o=Value(value="Cornish pasty recipe", is_uri=False)
    ),

    # Attribute triples (objects are literals, not URIs)
    Triple(
        s=Value(value=recipe_uri, is_uri=True),
        p=Value(value=serves_uri, is_uri=True),
        o=Value(value="4-6 people", is_uri=False)  # Literal value!
    ),
    Triple(
        s=Value(value=recipe_uri, is_uri=True),
        p=Value(value=prep_time_uri, is_uri=True),
        o=Value(value="45 minutes", is_uri=False)  # Literal value!
    )
]

Key Points:

  • LLM extracts literal values: "4-6 people", "45 minutes"
  • LLM includes entity type for disambiguation: entity-type
  • LLM uses attribute name from schema: "serves", "preparation_time"
  • Code looks up attribute URI from ontology datatype properties
  • Object is literal (is_uri=False), not a URI reference
  • Values stay as natural text, no normalization needed

Difference from Relationships:

  • Relationships: both subject and object are entities (URIs)
  • Attributes: subject is entity (URI), object is literal value (string/number)

Complete Example: Entities + Relationships + Attributes

Input Text:

Cornish pasty is a savory pastry filled with beef and potatoes.
This recipe serves 4 people.

What LLM Returns:

{
  "entities": [
    {
      "entity": "Cornish pasty",
      "type": "Recipe"
    },
    {
      "entity": "beef",
      "type": "Food"
    },
    {
      "entity": "potatoes",
      "type": "Food"
    }
  ],
  "relationships": [
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "has_ingredient",
      "object": "beef",
      "object-type": "Food"
    },
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "has_ingredient",
      "object": "potatoes",
      "object-type": "Food"
    }
  ],
  "attributes": [
    {
      "entity": "Cornish pasty",
      "entity-type": "Recipe",
      "attribute": "serves",
      "value": "4 people"
    }
  ]
}

Result: 11 RDF triples generated:

  • 3 entity type triples (rdf:type)
  • 3 entity label triples (rdfs:label) - automatic
  • 2 relationship triples (has_ingredient)
  • 1 attribute triple (serves)

All from simple, natural language extractions by the LLM!

References

  • Current implementation: trustgraph-flow/trustgraph/extract/kg/ontology/extract.py
  • Prompt template: ontology-prompt.md
  • Test cases: tests/unit/test_extract/test_ontology/
  • Example ontology: e2e/test-data/food.ontology