Feature/improve ontology extract (#576)

* Tech spec to change ontology extraction * Ontology extract refactoring
2026-06-09 06:45:13 +02:00 · 2025-12-03 13:36:10 +00:00 · 2025-12-03 13:36:10 +00:00 · b957004db9
commit b957004db9
parent 517434c075
6 changed files with 1496 additions and 19 deletions
--- a/docs/tech-specs/ontology-extract-phase-2.md
+++ b/docs/tech-specs/ontology-extract-phase-2.md
@ -0,0 +1,761 @@
 # Ontology Knowledge Extraction - Phase 2 Refactor
 **Status**: Draft
 **Author**: Analysis Session 2025-12-03
 **Related**: `ontology.md`, `ontorag.md`
 ## Overview
 This document identifies inconsistencies in the current ontology-based knowledge extraction system and proposes a refactor to improve LLM performance and reduce information loss.
 ## Current Implementation
 ### How It Works Now
 1. **Ontology Loading** (`ontology_loader.py`)
   - Loads ontology JSON with keys like `"fo/Recipe"`, `"fo/Food"`, `"fo/produces"`
   - Class IDs include namespace prefix in the key itself
   - Example from `food.ontology`:
     ```json
     "classes": {
       "fo/Recipe": {
         "uri": "http://purl.org/ontology/fo/Recipe",
         "rdfs:comment": "A Recipe is a combination..."
       }
     }
     ```
 2. **Prompt Construction** (`extract.py:299-307`, `ontology-prompt.md`)
   - Template receives `classes`, `object_properties`, `datatype_properties` dicts
   - Template iterates: `{% for class_id, class_def in classes.items() %}`
   - LLM sees: `**fo/Recipe**: A Recipe is a combination...`
   - Example output format shows:
     ```json
     {"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"}
     {"subject": "recipe:cornish-pasty", "predicate": "has_ingredient", "object": "ingredient:flour"}
     ```
 3. **Response Parsing** (`extract.py:382-428`)
   - Expects JSON array: `[{"subject": "...", "predicate": "...", "object": "..."}]`
   - Validates against ontology subset
   - Expands URIs via `expand_uri()` (extract.py:473-521)
 4. **URI Expansion** (`extract.py:473-521`)
   - Checks if value is in `ontology_subset.classes` dict
   - If found, extracts URI from class definition
   - If not found, constructs URI: `f"https://trustgraph.ai/ontology/{ontology_id}#{value}"`
 ### Data Flow Example
 **Ontology JSON → Loader → Prompt:**
 ```
 "fo/Recipe" → classes["fo/Recipe"] → LLM sees "**fo/Recipe**"
 ```
 **LLM → Parser → Output:**
 ```
 "Recipe" → not in classes["fo/Recipe"] → constructs URI → LOSES original URI
 "fo/Recipe" → found in classes → uses original URI → PRESERVES URI
 ```
 ## Problems Identified
 ### 1. **Inconsistent Examples in Prompt**
 **Issue**: The prompt template shows class IDs with prefixes (`fo/Recipe`) but the example output uses unprefixed class names (`Recipe`).
 **Location**: `ontology-prompt.md:5-52`
 ```markdown
 ## Ontology Classes:
 - **fo/Recipe**: A Recipe is...
 ## Example Output:
 {"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"}
 ```
 **Impact**: LLM receives conflicting signals about what format to use.
 ### 2. **Information Loss in URI Expansion**
 **Issue**: When LLM returns unprefixed class names following the example, `expand_uri()` can't find them in the ontology dict and constructs fallback URIs, losing the original proper URIs.
 **Location**: `extract.py:494-500`
 ```python
 if value in ontology_subset.classes:  # Looks for "Recipe"
    class_def = ontology_subset.classes[value]  # But key is "fo/Recipe"
    if isinstance(class_def, dict) and 'uri' in class_def:
        return class_def['uri']  # Never reached!
 return f"https://trustgraph.ai/ontology/{ontology_id}#{value}"  # Fallback
 ```
 **Impact**:
 - Original URI: `http://purl.org/ontology/fo/Recipe`
 - Constructed URI: `https://trustgraph.ai/ontology/food#Recipe`
 - Semantic meaning lost, breaks interoperability
 ### 3. **Ambiguous Entity Instance Format**
 **Issue**: No clear guidance on entity instance URI format.
 **Examples in prompt**:
 - `"recipe:cornish-pasty"` (namespace-like prefix)
 - `"ingredient:flour"` (different prefix)
 **Actual behavior** (extract.py:517-520):
 ```python
 # Treat as entity instance - construct unique URI
 normalized = value.replace(" ", "-").lower()
 return f"https://trustgraph.ai/{ontology_id}/{normalized}"
 ```
 **Impact**: LLM must guess prefixing convention with no ontology context.
 ### 4. **No Namespace Prefix Guidance**
 **Issue**: The ontology JSON contains namespace definitions (line 10-25 in food.ontology):
 ```json
 "namespaces": {
  "fo": "http://purl.org/ontology/fo/",
  "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
  ...
 }
 ```
 But these are never surfaced to the LLM. The LLM doesn't know:
 - What "fo" means
 - What prefix to use for entities
 - Which namespace applies to which elements
 ### 5. **Labels Not Used in Prompt**
 **Issue**: Every class has `rdfs:label` fields (e.g., `{"value": "Recipe", "lang": "en-gb"}`), but the prompt template doesn't use them.
 **Current**: Shows only `class_id` and `comment`
 ```jinja
 - **{{class_id}}**{% if class_def.comment %}: {{class_def.comment}}{% endif %}
 ```
 **Available but unused**:
 ```python
 "rdfs:label": [{"value": "Recipe", "lang": "en-gb"}]
 ```
 **Impact**: Could provide human-readable names alongside technical IDs.
 ## Proposed Solutions
 ### Option A: Normalize to Unprefixed IDs
 **Approach**: Strip prefixes from class IDs before showing to LLM.
 **Changes**:
 1. Modify `build_extraction_variables()` to transform keys:
   ```python
   classes_for_prompt = {
       k.split('/')[-1]: v  # "fo/Recipe" → "Recipe"
       for k, v in ontology_subset.classes.items()
   }
   ```
 2. Update prompt example to match (already uses unprefixed names)
 3. Modify `expand_uri()` to handle both formats:
   ```python
   # Try exact match first
   if value in ontology_subset.classes:
       return ontology_subset.classes[value]['uri']
   # Try with prefix
   for prefix in ['fo/', 'rdf:', 'rdfs:']:
       prefixed = f"{prefix}{value}"
       if prefixed in ontology_subset.classes:
           return ontology_subset.classes[prefixed]['uri']
   ```
 **Pros**:
 - Cleaner, more human-readable
 - Matches existing prompt examples
 - LLMs work better with simpler tokens
 **Cons**:
 - Class name collisions if multiple ontologies have same class name
 - Loses namespace information
 - Requires fallback logic for lookups
 ### Option B: Use Full Prefixed IDs Consistently
 **Approach**: Update examples to use prefixed IDs matching what's shown in the class list.
 **Changes**:
 1. Update prompt example (ontology-prompt.md:46-52):
   ```json
   [
     {"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "fo/Recipe"},
     {"subject": "recipe:cornish-pasty", "predicate": "rdfs:label", "object": "Cornish Pasty"},
     {"subject": "recipe:cornish-pasty", "predicate": "fo/produces", "object": "food:cornish-pasty"},
     {"subject": "food:cornish-pasty", "predicate": "rdf:type", "object": "fo/Food"}
   ]
   ```
 2. Add namespace explanation to prompt:
   ```markdown
   ## Namespace Prefixes:
   - **fo/**: Food Ontology (http://purl.org/ontology/fo/)
   - **rdf:**: RDF Schema
   - **rdfs:**: RDF Schema
   Use these prefixes exactly as shown when referencing classes and properties.
   ```
 3. Keep `expand_uri()` as-is (works correctly when matches found)
 **Pros**:
 - Input = Output consistency
 - No information loss
 - Preserves namespace semantics
 - Works with multiple ontologies
 **Cons**:
 - More verbose tokens for LLM
 - Requires LLM to track prefixes
 ### Option C: Hybrid - Show Both Label and ID
 **Approach**: Enhance prompt to show both human-readable labels and technical IDs.
 **Changes**:
 1. Update prompt template:
   ```jinja
   {% for class_id, class_def in classes.items() %}
   - **{{class_id}}** (label: "{{class_def.labels[0].value if class_def.labels else class_id}}"){% if class_def.comment %}: {{class_def.comment}}{% endif %}
   {% endfor %}
   ```
   Example output:
   ```markdown
   - **fo/Recipe** (label: "Recipe"): A Recipe is a combination...
   ```
 2. Update instructions:
   ```markdown
   When referencing classes:
   - Use the full prefixed ID (e.g., "fo/Recipe") in JSON output
   - The label (e.g., "Recipe") is for human understanding only
   ```
 **Pros**:
 - Clearest for LLM
 - Preserves all information
 - Explicit about what to use
 **Cons**:
 - Longer prompt
 - More complex template
 ## Implemented Approach
 **Simplified Entity-Relationship-Attribute Format** - completely replaces the old triple-based format.
 The new approach was chosen because:
 1. **No Information Loss**: Original URIs preserved correctly
 2. **Simpler Logic**: No transformation needed, direct dict lookups work
 3. **Namespace Safety**: Handles multiple ontologies without collisions
 4. **Semantic Correctness**: Maintains RDF/OWL semantics
 ## Implementation Complete
 ### What Was Built:
 1. **New Prompt Template** (`prompts/ontology-extract-v2.txt`)
   - ✅ Clear sections: Entity Types, Relationships, Attributes
   - ✅ Example using full type identifiers (`fo/Recipe`, `fo/has_ingredient`)
   - ✅ Instructions to use exact identifiers from schema
   - ✅ New JSON format with entities/relationships/attributes arrays
 2. **Entity Normalization** (`entity_normalizer.py`)
   - ✅ `normalize_entity_name()` - Converts names to URI-safe format
   - ✅ `normalize_type_identifier()` - Handles slashes in types (`fo/Recipe` → `fo-recipe`)
   - ✅ `build_entity_uri()` - Creates unique URIs using (name, type) tuple
   - ✅ `EntityRegistry` - Tracks entities for deduplication
 3. **JSON Parser** (`simplified_parser.py`)
   - ✅ Parses new format: `{entities: [...], relationships: [...], attributes: [...]}`
   - ✅ Supports kebab-case and snake_case field names
   - ✅ Returns structured dataclasses
   - ✅ Graceful error handling with logging
 4. **Triple Converter** (`triple_converter.py`)
   - ✅ `convert_entity()` - Generates type + label triples automatically
   - ✅ `convert_relationship()` - Connects entity URIs via properties
   - ✅ `convert_attribute()` - Adds literal values
   - ✅ Looks up full URIs from ontology definitions
 5. **Updated Main Processor** (`extract.py`)
   - ✅ Removed old triple-based extraction code
   - ✅ Added `extract_with_simplified_format()` method
   - ✅ Now exclusively uses new simplified format
   - ✅ Calls prompt with `extract-with-ontologies-v2` ID
 ## Test Cases
 ### Test 1: URI Preservation
 ```python
 # Given ontology class
 classes = {"fo/Recipe": {"uri": "http://purl.org/ontology/fo/Recipe", ...}}
 # When LLM returns
 llm_output = {"subject": "x", "predicate": "rdf:type", "object": "fo/Recipe"}
 # Then expanded URI should be
 assert expanded == "http://purl.org/ontology/fo/Recipe"
 # Not: "https://trustgraph.ai/ontology/food#Recipe"
 ```
 ### Test 2: Multi-Ontology Collision
 ```python
 # Given two ontologies
 ont1 = {"fo/Recipe": {...}}
 ont2 = {"cooking/Recipe": {...}}
 # LLM should use full prefix to disambiguate
 llm_output = {"object": "fo/Recipe"}  # Not just "Recipe"
 ```
 ### Test 3: Entity Instance Format
 ```python
 # Given prompt with food ontology
 # LLM should create instances like
 {"subject": "recipe:cornish-pasty"}  # Namespace-style
 {"subject": "food:beef"}              # Consistent prefix
 ```
 ## Open Questions
 1. **Should entity instances use namespace prefixes?**
   - Current: `"recipe:cornish-pasty"` (arbitrary)
   - Alternative: Use ontology prefix `"fo:cornish-pasty"`?
   - Alternative: No prefix, expand in URI `"cornish-pasty"` → full URI?
 2. **How to handle domain/range in prompt?**
   - Currently shows: `(Recipe → Food)`
   - Should it be: `(fo/Recipe → fo/Food)`?
 3. **Should we validate domain/range constraints?**
   - TODO comment at extract.py:470
   - Would catch more errors but more complex
 4. **What about inverse properties and equivalences?**
   - Ontology has `owl:inverseOf`, `owl:equivalentClass`
   - Not currently used in extraction
   - Should they be?
 ## Success Metrics
 - ✅ Zero URI information loss (100% preservation of original URIs)
 - ✅ LLM output format matches input format
 - ✅ No ambiguous examples in prompt
 - ✅ Tests pass with multiple ontologies
 - ✅ Improved extraction quality (measured by valid triple %)
 ## Alternative Approach: Simplified Extraction Format
 ### Philosophy
 Instead of asking the LLM to understand RDF/OWL semantics, ask it to do what it's good at: **find entities and relationships in text**.
 Let the code handle URI construction, RDF conversion, and semantic web formalities.
 ### Example: Entity Classification
 **Input Text:**
 ```
 Cornish pasty is a traditional British pastry filled with meat and vegetables.
 ```
 **Ontology Schema (shown to LLM):**
 ```markdown
 ## Entity Types:
 - Recipe: A recipe is a combination of ingredients and a method
 - Food: A food is something that can be eaten
 - Ingredient: An ingredient combines a quantity and a food
 ```
 **What LLM Returns (Simple JSON):**
 ```json
 {
  "entities": [
    {
      "entity": "Cornish pasty",
      "type": "Recipe"
    }
  ]
 }
 ```
 **What Code Produces (RDF Triples):**
 ```python
 # 1. Normalize entity name + type to ID (type prevents collisions)
 entity_id = "recipe-cornish-pasty"  # normalize("Cornish pasty", "Recipe")
 entity_uri = "https://trustgraph.ai/food/recipe-cornish-pasty"
 # Note: Same name, different type = different URI
 # "Cornish pasty" (Recipe) → recipe-cornish-pasty
 # "Cornish pasty" (Food) → food-cornish-pasty
 # 2. Generate triples
 triples = [
    # Type triple
    Triple(
        s=Value(value=entity_uri, is_uri=True),
        p=Value(value="http://www.w3.org/1999/02/22-rdf-syntax-ns#type", is_uri=True),
        o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True)
    ),
    # Label triple (automatic)
    Triple(
        s=Value(value=entity_uri, is_uri=True),
        p=Value(value="http://www.w3.org/2000/01/rdf-schema#label", is_uri=True),
        o=Value(value="Cornish pasty", is_uri=False)
    )
 ]
 ```
 ### Benefits
 1. **LLM doesn't need to:**
   - Understand URI syntax
   - Invent identifier prefixes (`recipe:`, `ingredient:`)
   - Know about `rdf:type` or `rdfs:label`
   - Construct semantic web identifiers
 2. **LLM just needs to:**
   - Find entities in text
   - Map them to ontology classes
   - Extract relationships and attributes
 3. **Code handles:**
   - URI normalization and construction
   - RDF triple generation
   - Automatic label assignment
   - Namespace management
 ### Why This Works Better
 - **Simpler prompt** = less confusion = fewer errors
 - **Consistent IDs** = code controls normalization rules
 - **Auto-generated labels** = no missing rdfs:label triples
 - **LLM focuses on extraction** = what it's actually good at
 ### Example: Entity Relationships
 **Input Text:**
 ```
 Cornish pasty is a traditional British pastry filled with beef and potatoes.
 ```
 **Ontology Schema (shown to LLM):**
 ```markdown
 ## Entity Types:
 - Recipe: A recipe is a combination of ingredients and a method
 - Food: A food is something that can be eaten
 - Ingredient: An ingredient combines a quantity and a food
 ## Relationships:
 - has_ingredient: Relates a recipe to an ingredient it uses (Recipe → Ingredient)
 - food: Relates an ingredient to the food that is required (Ingredient → Food)
 ```
 **What LLM Returns (Simple JSON):**
 ```json
 {
  "entities": [
    {
      "entity": "Cornish pasty",
      "type": "Recipe"
    },
    {
      "entity": "beef",
      "type": "Food"
    },
    {
      "entity": "potatoes",
      "type": "Food"
    }
  ],
  "relationships": [
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "has_ingredient",
      "object": "beef",
      "object-type": "Food"
    },
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "has_ingredient",
      "object": "potatoes",
      "object-type": "Food"
    }
  ]
 }
 ```
 **What Code Produces (RDF Triples):**
 ```python
 # Normalize entity names to URIs
 cornish_pasty_uri = "https://trustgraph.ai/food/cornish-pasty"
 beef_uri = "https://trustgraph.ai/food/beef"
 potatoes_uri = "https://trustgraph.ai/food/potatoes"
 # Look up relation URI from ontology
 has_ingredient_uri = "http://purl.org/ontology/fo/ingredients"  # from fo/has_ingredient
 triples = [
    # Entity type triples (as before)
    Triple(s=cornish_pasty_uri, p=rdf_type, o="http://purl.org/ontology/fo/Recipe"),
    Triple(s=cornish_pasty_uri, p=rdfs_label, o="Cornish pasty"),
    Triple(s=beef_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"),
    Triple(s=beef_uri, p=rdfs_label, o="beef"),
    Triple(s=potatoes_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"),
    Triple(s=potatoes_uri, p=rdfs_label, o="potatoes"),
    # Relationship triples
    Triple(
        s=Value(value=cornish_pasty_uri, is_uri=True),
        p=Value(value=has_ingredient_uri, is_uri=True),
        o=Value(value=beef_uri, is_uri=True)
    ),
    Triple(
        s=Value(value=cornish_pasty_uri, is_uri=True),
        p=Value(value=has_ingredient_uri, is_uri=True),
        o=Value(value=potatoes_uri, is_uri=True)
    )
 ]
 ```
 **Key Points:**
 - LLM returns natural language entity names: `"Cornish pasty"`, `"beef"`, `"potatoes"`
 - LLM includes types to disambiguate: `subject-type`, `object-type`
 - LLM uses relation name from schema: `"has_ingredient"`
 - Code derives consistent IDs using (name, type): `("Cornish pasty", "Recipe")` → `recipe-cornish-pasty`
 - Code looks up relation URI from ontology: `fo/has_ingredient` → full URI
 - Same (name, type) tuple always gets same URI (deduplication)
 ### Example: Entity Name Disambiguation
 **Problem:** Same name can refer to different entity types.
 **Real-world case:**
 ```
 "Cornish pasty" can be:
 - A Recipe (instructions for making it)
 - A Food (the dish itself)
 ```
 **How It's Handled:**
 LLM returns both as separate entities:
 ```json
 {
  "entities": [
    {"entity": "Cornish pasty", "type": "Recipe"},
    {"entity": "Cornish pasty", "type": "Food"}
  ],
  "relationships": [
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "produces",
      "object": "Cornish pasty",
      "object-type": "Food"
    }
  ]
 }
 ```
 **Code Resolution:**
 ```python
 # Different types → different URIs
 recipe_uri = normalize("Cornish pasty", "Recipe")
 # → "https://trustgraph.ai/food/recipe-cornish-pasty"
 food_uri = normalize("Cornish pasty", "Food")
 # → "https://trustgraph.ai/food/food-cornish-pasty"
 # Relationship connects them correctly
 triple = Triple(
    s=recipe_uri,  # The Recipe
    p="http://purl.org/ontology/fo/produces",
    o=food_uri     # The Food
 )
 ```
 **Why This Works:**
 - Type is included in ALL references (entities, relationships, attributes)
 - Code uses `(name, type)` tuple as lookup key
 - No ambiguity, no collisions
 ### Example: Entity Attributes
 **Input Text:**
 ```
 This Cornish pasty recipe serves 4-6 people and takes 45 minutes to prepare.
 ```
 **Ontology Schema (shown to LLM):**
 ```markdown
 ## Entity Types:
 - Recipe: A recipe is a combination of ingredients and a method
 ## Attributes:
 - serves: Indicates what the recipe is intended to serve (Recipe → text)
 - preparation_time: Time needed to prepare the recipe (Recipe → text)
 ```
 **What LLM Returns (Simple JSON):**
 ```json
 {
  "entities": [
    {
      "entity": "Cornish pasty recipe",
      "type": "Recipe"
    }
  ],
  "attributes": [
    {
      "entity": "Cornish pasty recipe",
      "entity-type": "Recipe",
      "attribute": "serves",
      "value": "4-6 people"
    },
    {
      "entity": "Cornish pasty recipe",
      "entity-type": "Recipe",
      "attribute": "preparation_time",
      "value": "45 minutes"
    }
  ]
 }
 ```
 **What Code Produces (RDF Triples):**
 ```python
 # Normalize entity name to URI
 recipe_uri = "https://trustgraph.ai/food/cornish-pasty-recipe"
 # Look up attribute URIs from ontology
 serves_uri = "http://purl.org/ontology/fo/serves"  # from fo/serves
 prep_time_uri = "http://purl.org/ontology/fo/preparation_time"  # from fo/preparation_time
 triples = [
    # Entity type triple
    Triple(
        s=Value(value=recipe_uri, is_uri=True),
        p=Value(value=rdf_type, is_uri=True),
        o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True)
    ),
    # Label triple (automatic)
    Triple(
        s=Value(value=recipe_uri, is_uri=True),
        p=Value(value=rdfs_label, is_uri=True),
        o=Value(value="Cornish pasty recipe", is_uri=False)
    ),
    # Attribute triples (objects are literals, not URIs)
    Triple(
        s=Value(value=recipe_uri, is_uri=True),
        p=Value(value=serves_uri, is_uri=True),
        o=Value(value="4-6 people", is_uri=False)  # Literal value!
    ),
    Triple(
        s=Value(value=recipe_uri, is_uri=True),
        p=Value(value=prep_time_uri, is_uri=True),
        o=Value(value="45 minutes", is_uri=False)  # Literal value!
    )
 ]
 ```
 **Key Points:**
 - LLM extracts literal values: `"4-6 people"`, `"45 minutes"`
 - LLM includes entity type for disambiguation: `entity-type`
 - LLM uses attribute name from schema: `"serves"`, `"preparation_time"`
 - Code looks up attribute URI from ontology datatype properties
 - **Object is literal** (`is_uri=False`), not a URI reference
 - Values stay as natural text, no normalization needed
 **Difference from Relationships:**
 - Relationships: both subject and object are entities (URIs)
 - Attributes: subject is entity (URI), object is literal value (string/number)
 ### Complete Example: Entities + Relationships + Attributes
 **Input Text:**
 ```
 Cornish pasty is a savory pastry filled with beef and potatoes.
 This recipe serves 4 people.
 ```
 **What LLM Returns:**
 ```json
 {
  "entities": [
    {
      "entity": "Cornish pasty",
      "type": "Recipe"
    },
    {
      "entity": "beef",
      "type": "Food"
    },
    {
      "entity": "potatoes",
      "type": "Food"
    }
  ],
  "relationships": [
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "has_ingredient",
      "object": "beef",
      "object-type": "Food"
    },
    {
      "subject": "Cornish pasty",
      "subject-type": "Recipe",
      "relation": "has_ingredient",
      "object": "potatoes",
      "object-type": "Food"
    }
  ],
  "attributes": [
    {
      "entity": "Cornish pasty",
      "entity-type": "Recipe",
      "attribute": "serves",
      "value": "4 people"
    }
  ]
 }
 ```
 **Result:** 11 RDF triples generated:
 - 3 entity type triples (rdf:type)
 - 3 entity label triples (rdfs:label) - automatic
 - 2 relationship triples (has_ingredient)
 - 1 attribute triple (serves)
 All from simple, natural language extractions by the LLM!
 ## References
 - Current implementation: `trustgraph-flow/trustgraph/extract/kg/ontology/extract.py`
 - Prompt template: `ontology-prompt.md`
 - Test cases: `tests/unit/test_extract/test_ontology/`
 - Example ontology: `e2e/test-data/food.ontology`
--- a/ontology-prompt.md
+++ b/ontology-prompt.md
@ -0,0 +1,54 @@
 You are a knowledge extraction expert. Extract structured triples from text using ONLY the provided ontology elements.
 ## Ontology Classes:
 {% for class_id, class_def in classes.items() %}
 - **{{class_id}}**{% if class_def.subclass_of %} (subclass of {{class_def.subclass_of}}){% endif %}{% if class_def.comment %}: {{class_def.comment}}{% endif %}
 {% endfor %}
 ## Object Properties (connect entities):
 {% for prop_id, prop_def in object_properties.items() %}
 - **{{prop_id}}**{% if prop_def.domain and prop_def.range %} ({{prop_def.domain}} → {{prop_def.range}}){% endif %}{% if prop_def.comment %}: {{prop_def.comment}}{% endif %}
 {% endfor %}
 ## Datatype Properties (entity attributes):
 {% for prop_id, prop_def in datatype_properties.items() %}
 - **{{prop_id}}**{% if prop_def.domain and prop_def.range %} ({{prop_def.domain}} → {{prop_def.range}}){% endif %}{% if prop_def.comment %}: {{prop_def.comment}}{% endif %}
 {% endfor %}
 ## Text to Analyze:
 {{text}}
 ## Extraction Rules:
 1. Only use classes defined above for entity types
 2. Only use properties defined above for relationships and attributes
 3. Respect domain and range constraints where specified
 4. For class instances, use `rdf:type` as the predicate
 5. Include `rdfs:label` for new entities to provide human-readable names
 6. Extract all relevant triples that can be inferred from the text
 7. Use entity URIs or meaningful identifiers as subjects/objects
 ## Output Format:
 Return ONLY a valid JSON array (no markdown, no code blocks) containing objects with these fields:
 - "subject": the subject entity (URI or identifier)
 - "predicate": the property (from ontology or rdf:type/rdfs:label)
 - "object": the object entity or literal value
 Important: Return raw JSON only, with no markdown formatting, no code blocks, and no backticks.
 ## Example Output:
 [
  {"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"},
  {"subject": "recipe:cornish-pasty", "predicate": "rdfs:label", "object": "Cornish Pasty"},
  {"subject": "recipe:cornish-pasty", "predicate": "has_ingredient", "object": "ingredient:flour"},
  {"subject": "ingredient:flour", "predicate": "rdf:type", "object": "Ingredient"},
  {"subject": "ingredient:flour", "predicate": "rdfs:label", "object": "plain flour"}
 ]
 Now extract triples from the text above.
--- a/trustgraph-flow/trustgraph/extract/kg/ontology/entity_normalizer.py
+++ b/trustgraph-flow/trustgraph/extract/kg/ontology/entity_normalizer.py
@ -0,0 +1,164 @@
 """
 Entity URI normalization for ontology-based knowledge extraction.
 Converts entity names and types into consistent, collision-free URIs.
 """
 import re
 from typing import Tuple
 def normalize_entity_name(entity_name: str) -> str:
    """Normalize entity name to URI-safe identifier.
    Args:
        entity_name: Natural language entity name (e.g., "Cornish pasty")
    Returns:
        Normalized identifier (e.g., "cornish-pasty")
    """
    # Convert to lowercase
    normalized = entity_name.lower()
    # Replace spaces and underscores with hyphens
    normalized = re.sub(r'[\s_]+', '-', normalized)
    # Remove any characters that aren't alphanumeric, hyphens, or periods
    normalized = re.sub(r'[^a-z0-9\-.]', '', normalized)
    # Remove leading/trailing hyphens
    normalized = normalized.strip('-')
    # Collapse multiple hyphens
    normalized = re.sub(r'-+', '-', normalized)
    return normalized
 def normalize_type_identifier(type_id: str) -> str:
    """Normalize ontology type identifier to URI-safe format.
    Handles prefixed types like "fo/Recipe" by converting to "fo-recipe".
    Args:
        type_id: Ontology type identifier (e.g., "fo/Recipe", "Food")
    Returns:
        Normalized type identifier (e.g., "fo-recipe", "food")
    """
    # Convert to lowercase
    normalized = type_id.lower()
    # Replace slashes, colons, and spaces with hyphens
    normalized = re.sub(r'[/:.\s_]+', '-', normalized)
    # Remove any remaining non-alphanumeric characters except hyphens
    normalized = re.sub(r'[^a-z0-9\-]', '', normalized)
    # Remove leading/trailing hyphens
    normalized = normalized.strip('-')
    # Collapse multiple hyphens
    normalized = re.sub(r'-+', '-', normalized)
    return normalized
 def build_entity_uri(entity_name: str, entity_type: str, ontology_id: str,
                    base_uri: str = "https://trustgraph.ai") -> str:
    """Build a unique URI for an entity based on its name and type.
    The type is included in the URI to prevent collisions when the same
    name refers to different entity types (e.g., "Cornish pasty" as both
    Recipe and Food).
    Args:
        entity_name: Natural language entity name (e.g., "Cornish pasty")
        entity_type: Ontology type (e.g., "fo/Recipe")
        ontology_id: Ontology identifier (e.g., "food")
        base_uri: Base URI for entity URIs (default: "https://trustgraph.ai")
    Returns:
        Full entity URI (e.g., "https://trustgraph.ai/food/fo-recipe-cornish-pasty")
    Examples:
        >>> build_entity_uri("Cornish pasty", "fo/Recipe", "food")
        'https://trustgraph.ai/food/fo-recipe-cornish-pasty'
        >>> build_entity_uri("Cornish pasty", "fo/Food", "food")
        'https://trustgraph.ai/food/fo-food-cornish-pasty'
        >>> build_entity_uri("beef", "fo/Food", "food")
        'https://trustgraph.ai/food/fo-food-beef'
    """
    type_part = normalize_type_identifier(entity_type)
    name_part = normalize_entity_name(entity_name)
    # Combine type and name to ensure uniqueness
    entity_id = f"{type_part}-{name_part}"
    # Build full URI
    return f"{base_uri}/{ontology_id}/{entity_id}"
 class EntityRegistry:
    """Registry to track entity name/type tuples and their assigned URIs.
    Ensures that the same (entity_name, entity_type) tuple always maps
    to the same URI, enabling deduplication across the extraction process.
    """
    def __init__(self, ontology_id: str, base_uri: str = "https://trustgraph.ai"):
        """Initialize the entity registry.
        Args:
            ontology_id: Ontology identifier (e.g., "food")
            base_uri: Base URI for entity URIs
        """
        self.ontology_id = ontology_id
        self.base_uri = base_uri
        self._registry = {}  # (entity_name, entity_type) -> uri
    def get_or_create_uri(self, entity_name: str, entity_type: str) -> str:
        """Get existing URI or create new one for entity.
        Args:
            entity_name: Natural language entity name
            entity_type: Ontology type identifier
        Returns:
            URI for this entity (same URI for same name/type tuple)
        """
        key = (entity_name, entity_type)
        if key not in self._registry:
            uri = build_entity_uri(
                entity_name,
                entity_type,
                self.ontology_id,
                self.base_uri
            )
            self._registry[key] = uri
        return self._registry[key]
    def lookup(self, entity_name: str, entity_type: str) -> str:
        """Look up URI for entity (returns None if not registered).
        Args:
            entity_name: Natural language entity name
            entity_type: Ontology type identifier
        Returns:
            URI for this entity, or None if not found
        """
        key = (entity_name, entity_type)
        return self._registry.get(key)
    def clear(self):
        """Clear all registered entities."""
        self._registry.clear()
    def size(self) -> int:
        """Get number of registered entities."""
        return len(self._registry)
--- a/trustgraph-flow/trustgraph/extract/kg/ontology/extract.py
+++ b/trustgraph-flow/trustgraph/extract/kg/ontology/extract.py
@ -20,6 +20,8 @@ from .ontology_embedder import OntologyEmbedder
 from .vector_store import InMemoryVectorStore
 from .text_processor import TextProcessor
 from .ontology_selector import OntologySelector, OntologySubset
 from .simplified_parser import parse_extraction_response
 from .triple_converter import TripleConverter
 logger = logging.getLogger(__name__)
@ -298,25 +300,10 @@ class Processor(FlowProcessor):
            # Build extraction prompt variables
            prompt_variables = self.build_extraction_variables(chunk, ontology_subset)
-            # Call prompt service for extraction
+            # Extract using simplified entity-relationship-attribute format
-            try:
+            triples = await self.extract_with_simplified_format(
-                # Use prompt() method with extract-with-ontologies prompt ID
+                flow, chunk, ontology_subset, prompt_variables
-                triples_response = await flow("prompt-request").prompt(
+            )
                    id="extract-with-ontologies",
                    variables=prompt_variables
                )
                logger.debug(f"Extraction response: {triples_response}")
                if not isinstance(triples_response, list):
                    logger.error("Expected list of triples from prompt service")
                    triples_response = []
            except Exception as e:
                logger.error(f"Prompt service error: {e}", exc_info=True)
                triples_response = []
            # Parse and validate triples
            triples = self.parse_and_validate_triples(triples_response, ontology_subset)
            # Add metadata triples
            for t in v.metadata.metadata:
@ -362,6 +349,55 @@ class Processor(FlowProcessor):
                []
            )
    async def extract_with_simplified_format(
        self,
        flow,
        chunk: str,
        ontology_subset: OntologySubset,
        prompt_variables: Dict[str, Any]
    ) -> List[Triple]:
        """Extract triples using simplified entity-relationship-attribute format.
        Args:
            flow: Flow object for accessing services
            chunk: Text chunk to extract from
            ontology_subset: Selected ontology subset
            prompt_variables: Variables for prompt template
        Returns:
            List of Triple objects
        """
        try:
            # Call prompt service with simplified format prompt
            extraction_response = await flow("prompt-request").prompt(
                id="extract-with-ontologies",
                variables=prompt_variables
            )
            logger.debug(f"Simplified extraction response: {extraction_response}")
            # Parse response into structured format
            extraction_result = parse_extraction_response(extraction_response)
            if not extraction_result:
                logger.warning("Failed to parse extraction response")
                return []
            logger.info(f"Parsed {len(extraction_result.entities)} entities, "
                       f"{len(extraction_result.relationships)} relationships, "
                       f"{len(extraction_result.attributes)} attributes")
            # Convert to RDF triples
            converter = TripleConverter(ontology_subset, ontology_subset.ontology_id)
            triples = converter.convert_all(extraction_result)
            logger.info(f"Generated {len(triples)} RDF triples from simplified extraction")
            return triples
        except Exception as e:
            logger.error(f"Simplified extraction error: {e}", exc_info=True)
            return []
    def build_extraction_variables(self, chunk: str, ontology_subset: OntologySubset) -> Dict[str, Any]:
        """Build variables for ontology-based extraction prompt template.
--- a/trustgraph-flow/trustgraph/extract/kg/ontology/simplified_parser.py
+++ b/trustgraph-flow/trustgraph/extract/kg/ontology/simplified_parser.py
@ -0,0 +1,234 @@
 """
 Parser for simplified ontology extraction JSON format.
 Parses the new entity-relationship-attribute format from LLM responses.
 """
 import json
 import logging
 from typing import List, Dict, Any, Optional
 from dataclasses import dataclass
 logger = logging.getLogger(__name__)
@dataclass
 class Entity:
    """Represents an extracted entity."""
    entity: str
    type: str
@dataclass
 class Relationship:
    """Represents an extracted relationship."""
    subject: str
    subject_type: str
    relation: str
    object: str
    object_type: str
@dataclass
 class Attribute:
    """Represents an extracted attribute."""
    entity: str
    entity_type: str
    attribute: str
    value: str
@dataclass
 class ExtractionResult:
    """Complete extraction result."""
    entities: List[Entity]
    relationships: List[Relationship]
    attributes: List[Attribute]
 def parse_extraction_response(response: Any) -> Optional[ExtractionResult]:
    """Parse LLM extraction response into structured format.
    Args:
        response: LLM response (string JSON or already parsed dict)
    Returns:
        ExtractionResult with parsed entities/relationships/attributes,
        or None if parsing fails
    """
    # Handle string response (parse JSON)
    if isinstance(response, str):
        try:
            data = json.loads(response)
        except json.JSONDecodeError as e:
            logger.error(f"Failed to parse JSON response: {e}")
            logger.debug(f"Response was: {response[:500]}")
            return None
    elif isinstance(response, dict):
        data = response
    else:
        logger.error(f"Unexpected response type: {type(response)}")
        return None
    # Validate structure
    if not isinstance(data, dict):
        logger.error(f"Expected dict, got {type(data)}")
        return None
    # Parse entities
    entities = []
    entities_data = data.get('entities', [])
    if not isinstance(entities_data, list):
        logger.warning(f"'entities' is not a list: {type(entities_data)}")
        entities_data = []
    for entity_data in entities_data:
        try:
            entity = parse_entity(entity_data)
            if entity:
                entities.append(entity)
        except Exception as e:
            logger.warning(f"Failed to parse entity {entity_data}: {e}")
    # Parse relationships
    relationships = []
    relationships_data = data.get('relationships', [])
    if not isinstance(relationships_data, list):
        logger.warning(f"'relationships' is not a list: {type(relationships_data)}")
        relationships_data = []
    for rel_data in relationships_data:
        try:
            relationship = parse_relationship(rel_data)
            if relationship:
                relationships.append(relationship)
        except Exception as e:
            logger.warning(f"Failed to parse relationship {rel_data}: {e}")
    # Parse attributes
    attributes = []
    attributes_data = data.get('attributes', [])
    if not isinstance(attributes_data, list):
        logger.warning(f"'attributes' is not a list: {type(attributes_data)}")
        attributes_data = []
    for attr_data in attributes_data:
        try:
            attribute = parse_attribute(attr_data)
            if attribute:
                attributes.append(attribute)
        except Exception as e:
            logger.warning(f"Failed to parse attribute {attr_data}: {e}")
    return ExtractionResult(
        entities=entities,
        relationships=relationships,
        attributes=attributes
    )
 def parse_entity(data: Dict[str, Any]) -> Optional[Entity]:
    """Parse entity from dict.
    Supports both kebab-case and snake_case field names for compatibility.
    Args:
        data: Entity dict with 'entity' and 'type' fields
    Returns:
        Entity object or None if invalid
    """
    if not isinstance(data, dict):
        logger.warning(f"Entity data is not a dict: {type(data)}")
        return None
    entity = data.get('entity')
    entity_type = data.get('type')
    if not entity or not entity_type:
        logger.warning(f"Missing required fields in entity: {data}")
        return None
    if not isinstance(entity, str) or not isinstance(entity_type, str):
        logger.warning(f"Entity fields must be strings: {data}")
        return None
    return Entity(entity=entity, type=entity_type)
 def parse_relationship(data: Dict[str, Any]) -> Optional[Relationship]:
    """Parse relationship from dict.
    Supports both kebab-case and snake_case field names for compatibility.
    Args:
        data: Relationship dict with subject, subject-type, relation, object, object-type
    Returns:
        Relationship object or None if invalid
    """
    if not isinstance(data, dict):
        logger.warning(f"Relationship data is not a dict: {type(data)}")
        return None
    subject = data.get('subject')
    subject_type = data.get('subject-type') or data.get('subject_type')
    relation = data.get('relation')
    obj = data.get('object')
    object_type = data.get('object-type') or data.get('object_type')
    if not all([subject, subject_type, relation, obj, object_type]):
        logger.warning(f"Missing required fields in relationship: {data}")
        return None
    if not all(isinstance(v, str) for v in [subject, subject_type, relation, obj, object_type]):
        logger.warning(f"Relationship fields must be strings: {data}")
        return None
    return Relationship(
        subject=subject,
        subject_type=subject_type,
        relation=relation,
        object=obj,
        object_type=object_type
    )
 def parse_attribute(data: Dict[str, Any]) -> Optional[Attribute]:
    """Parse attribute from dict.
    Supports both kebab-case and snake_case field names for compatibility.
    Args:
        data: Attribute dict with entity, entity-type, attribute, value
    Returns:
        Attribute object or None if invalid
    """
    if not isinstance(data, dict):
        logger.warning(f"Attribute data is not a dict: {type(data)}")
        return None
    entity = data.get('entity')
    entity_type = data.get('entity-type') or data.get('entity_type')
    attribute = data.get('attribute')
    value = data.get('value')
    if not all([entity, entity_type, attribute, value is not None]):
        logger.warning(f"Missing required fields in attribute: {data}")
        return None
    if not all(isinstance(v, str) for v in [entity, entity_type, attribute]):
        logger.warning(f"Attribute fields must be strings: {data}")
        return None
    # Value can be string, number, bool - convert to string
    if not isinstance(value, str):
        value = str(value)
    return Attribute(
        entity=entity,
        entity_type=entity_type,
        attribute=attribute,
        value=value
    )
--- a/trustgraph-flow/trustgraph/extract/kg/ontology/triple_converter.py
+++ b/trustgraph-flow/trustgraph/extract/kg/ontology/triple_converter.py
@ -0,0 +1,228 @@
 """
 Converts simplified extraction format to RDF triples.
 Transforms entities, relationships, and attributes into proper RDF triples
 with full URIs and correct is_uri flags.
 """
 import logging
 from typing import List, Optional
 from .... schema import Triple, Value
 from .... rdf import RDF_TYPE, RDF_LABEL
 from .simplified_parser import Entity, Relationship, Attribute, ExtractionResult
 from .entity_normalizer import EntityRegistry
 from .ontology_selector import OntologySubset
 logger = logging.getLogger(__name__)
 class TripleConverter:
    """Converts extraction results to RDF triples."""
    def __init__(self, ontology_subset: OntologySubset, ontology_id: str):
        """Initialize converter.
        Args:
            ontology_subset: Ontology subset with classes and properties
            ontology_id: Ontology identifier for URI generation
        """
        self.ontology_subset = ontology_subset
        self.ontology_id = ontology_id
        self.entity_registry = EntityRegistry(ontology_id)
    def convert_all(self, extraction: ExtractionResult) -> List[Triple]:
        """Convert complete extraction result to RDF triples.
        Args:
            extraction: Parsed extraction with entities/relationships/attributes
        Returns:
            List of RDF Triple objects
        """
        triples = []
        # Convert entities (generates type + label triples)
        for entity in extraction.entities:
            entity_triples = self.convert_entity(entity)
            triples.extend(entity_triples)
        # Convert relationships
        for relationship in extraction.relationships:
            rel_triple = self.convert_relationship(relationship)
            if rel_triple:
                triples.append(rel_triple)
        # Convert attributes
        for attribute in extraction.attributes:
            attr_triple = self.convert_attribute(attribute)
            if attr_triple:
                triples.append(attr_triple)
        return triples
    def convert_entity(self, entity: Entity) -> List[Triple]:
        """Convert entity to RDF triples (type + label).
        Args:
            entity: Entity object with name and type
        Returns:
            List containing type triple and label triple
        """
        triples = []
        # Get or create URI for this entity
        entity_uri = self.entity_registry.get_or_create_uri(
            entity.entity,
            entity.type
        )
        # Look up class URI from ontology
        class_uri = self._get_class_uri(entity.type)
        if not class_uri:
            logger.warning(f"Unknown entity type '{entity.type}', skipping entity '{entity.entity}'")
            return triples
        # Generate type triple: entity rdf:type ClassURI
        type_triple = Triple(
            s=Value(value=entity_uri, is_uri=True),
            p=Value(value=RDF_TYPE, is_uri=True),
            o=Value(value=class_uri, is_uri=True)
        )
        triples.append(type_triple)
        # Generate label triple: entity rdfs:label "entity name"
        label_triple = Triple(
            s=Value(value=entity_uri, is_uri=True),
            p=Value(value=RDF_LABEL, is_uri=True),
            o=Value(value=entity.entity, is_uri=False)  # Literal!
        )
        triples.append(label_triple)
        return triples
    def convert_relationship(self, relationship: Relationship) -> Optional[Triple]:
        """Convert relationship to RDF triple.
        Args:
            relationship: Relationship with subject/object entities and relation
        Returns:
            Triple connecting two entity URIs via property URI, or None if invalid
        """
        # Get URIs for subject and object entities
        subject_uri = self.entity_registry.get_or_create_uri(
            relationship.subject,
            relationship.subject_type
        )
        object_uri = self.entity_registry.get_or_create_uri(
            relationship.object,
            relationship.object_type
        )
        # Look up property URI from ontology
        property_uri = self._get_object_property_uri(relationship.relation)
        if not property_uri:
            logger.warning(f"Unknown relationship '{relationship.relation}', skipping")
            return None
        # Generate triple: subject property object
        return Triple(
            s=Value(value=subject_uri, is_uri=True),
            p=Value(value=property_uri, is_uri=True),
            o=Value(value=object_uri, is_uri=True)
        )
    def convert_attribute(self, attribute: Attribute) -> Optional[Triple]:
        """Convert attribute to RDF triple.
        Args:
            attribute: Attribute with entity, attribute name, and literal value
        Returns:
            Triple with entity URI, property URI, and literal value, or None if invalid
        """
        # Get URI for entity
        entity_uri = self.entity_registry.get_or_create_uri(
            attribute.entity,
            attribute.entity_type
        )
        # Look up property URI from ontology
        property_uri = self._get_datatype_property_uri(attribute.attribute)
        if not property_uri:
            logger.warning(f"Unknown attribute '{attribute.attribute}', skipping")
            return None
        # Generate triple: entity property "literal value"
        return Triple(
            s=Value(value=entity_uri, is_uri=True),
            p=Value(value=property_uri, is_uri=True),
            o=Value(value=attribute.value, is_uri=False)  # Literal!
        )
    def _get_class_uri(self, class_id: str) -> Optional[str]:
        """Get full URI for ontology class.
        Args:
            class_id: Class identifier (e.g., "fo/Recipe")
        Returns:
            Full class URI or None if not found
        """
        if class_id not in self.ontology_subset.classes:
            return None
        class_def = self.ontology_subset.classes[class_id]
        # Extract URI from class definition
        if isinstance(class_def, dict) and 'uri' in class_def:
            return class_def['uri']
        # Fallback: construct URI
        return f"https://trustgraph.ai/ontology/{self.ontology_id}#{class_id}"
    def _get_object_property_uri(self, property_id: str) -> Optional[str]:
        """Get full URI for object property.
        Args:
            property_id: Property identifier (e.g., "fo/has_ingredient")
        Returns:
            Full property URI or None if not found
        """
        if property_id not in self.ontology_subset.object_properties:
            return None
        prop_def = self.ontology_subset.object_properties[property_id]
        # Extract URI from property definition
        if isinstance(prop_def, dict) and 'uri' in prop_def:
            return prop_def['uri']
        # Fallback: construct URI
        return f"https://trustgraph.ai/ontology/{self.ontology_id}#{property_id}"
    def _get_datatype_property_uri(self, property_id: str) -> Optional[str]:
        """Get full URI for datatype property.
        Args:
            property_id: Property identifier (e.g., "fo/serves")
        Returns:
            Full property URI or None if not found
        """
        if property_id not in self.ontology_subset.datatype_properties:
            return None
        prop_def = self.ontology_subset.datatype_properties[property_id]
        # Extract URI from property definition
        if isinstance(prop_def, dict) and 'uri' in prop_def:
            return prop_def['uri']
        # Fallback: construct URI
        return f"https://trustgraph.ai/ontology/{self.ontology_id}#{property_id}"