Feature/improve ontology extract (#576)

* Tech spec to change ontology extraction

* Ontology extract refactoring
This commit is contained in:
cybermaggedon 2025-12-03 13:36:10 +00:00 committed by GitHub
parent 517434c075
commit b957004db9
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
6 changed files with 1496 additions and 19 deletions

View file

@ -0,0 +1,761 @@
# Ontology Knowledge Extraction - Phase 2 Refactor
**Status**: Draft
**Author**: Analysis Session 2025-12-03
**Related**: `ontology.md`, `ontorag.md`
## Overview
This document identifies inconsistencies in the current ontology-based knowledge extraction system and proposes a refactor to improve LLM performance and reduce information loss.
## Current Implementation
### How It Works Now
1. **Ontology Loading** (`ontology_loader.py`)
- Loads ontology JSON with keys like `"fo/Recipe"`, `"fo/Food"`, `"fo/produces"`
- Class IDs include namespace prefix in the key itself
- Example from `food.ontology`:
```json
"classes": {
"fo/Recipe": {
"uri": "http://purl.org/ontology/fo/Recipe",
"rdfs:comment": "A Recipe is a combination..."
}
}
```
2. **Prompt Construction** (`extract.py:299-307`, `ontology-prompt.md`)
- Template receives `classes`, `object_properties`, `datatype_properties` dicts
- Template iterates: `{% for class_id, class_def in classes.items() %}`
- LLM sees: `**fo/Recipe**: A Recipe is a combination...`
- Example output format shows:
```json
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"}
{"subject": "recipe:cornish-pasty", "predicate": "has_ingredient", "object": "ingredient:flour"}
```
3. **Response Parsing** (`extract.py:382-428`)
- Expects JSON array: `[{"subject": "...", "predicate": "...", "object": "..."}]`
- Validates against ontology subset
- Expands URIs via `expand_uri()` (extract.py:473-521)
4. **URI Expansion** (`extract.py:473-521`)
- Checks if value is in `ontology_subset.classes` dict
- If found, extracts URI from class definition
- If not found, constructs URI: `f"https://trustgraph.ai/ontology/{ontology_id}#{value}"`
### Data Flow Example
**Ontology JSON → Loader → Prompt:**
```
"fo/Recipe" → classes["fo/Recipe"] → LLM sees "**fo/Recipe**"
```
**LLM → Parser → Output:**
```
"Recipe" → not in classes["fo/Recipe"] → constructs URI → LOSES original URI
"fo/Recipe" → found in classes → uses original URI → PRESERVES URI
```
## Problems Identified
### 1. **Inconsistent Examples in Prompt**
**Issue**: The prompt template shows class IDs with prefixes (`fo/Recipe`) but the example output uses unprefixed class names (`Recipe`).
**Location**: `ontology-prompt.md:5-52`
```markdown
## Ontology Classes:
- **fo/Recipe**: A Recipe is...
## Example Output:
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"}
```
**Impact**: LLM receives conflicting signals about what format to use.
### 2. **Information Loss in URI Expansion**
**Issue**: When LLM returns unprefixed class names following the example, `expand_uri()` can't find them in the ontology dict and constructs fallback URIs, losing the original proper URIs.
**Location**: `extract.py:494-500`
```python
if value in ontology_subset.classes: # Looks for "Recipe"
class_def = ontology_subset.classes[value] # But key is "fo/Recipe"
if isinstance(class_def, dict) and 'uri' in class_def:
return class_def['uri'] # Never reached!
return f"https://trustgraph.ai/ontology/{ontology_id}#{value}" # Fallback
```
**Impact**:
- Original URI: `http://purl.org/ontology/fo/Recipe`
- Constructed URI: `https://trustgraph.ai/ontology/food#Recipe`
- Semantic meaning lost, breaks interoperability
### 3. **Ambiguous Entity Instance Format**
**Issue**: No clear guidance on entity instance URI format.
**Examples in prompt**:
- `"recipe:cornish-pasty"` (namespace-like prefix)
- `"ingredient:flour"` (different prefix)
**Actual behavior** (extract.py:517-520):
```python
# Treat as entity instance - construct unique URI
normalized = value.replace(" ", "-").lower()
return f"https://trustgraph.ai/{ontology_id}/{normalized}"
```
**Impact**: LLM must guess prefixing convention with no ontology context.
### 4. **No Namespace Prefix Guidance**
**Issue**: The ontology JSON contains namespace definitions (line 10-25 in food.ontology):
```json
"namespaces": {
"fo": "http://purl.org/ontology/fo/",
"rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
...
}
```
But these are never surfaced to the LLM. The LLM doesn't know:
- What "fo" means
- What prefix to use for entities
- Which namespace applies to which elements
### 5. **Labels Not Used in Prompt**
**Issue**: Every class has `rdfs:label` fields (e.g., `{"value": "Recipe", "lang": "en-gb"}`), but the prompt template doesn't use them.
**Current**: Shows only `class_id` and `comment`
```jinja
- **{{class_id}}**{% if class_def.comment %}: {{class_def.comment}}{% endif %}
```
**Available but unused**:
```python
"rdfs:label": [{"value": "Recipe", "lang": "en-gb"}]
```
**Impact**: Could provide human-readable names alongside technical IDs.
## Proposed Solutions
### Option A: Normalize to Unprefixed IDs
**Approach**: Strip prefixes from class IDs before showing to LLM.
**Changes**:
1. Modify `build_extraction_variables()` to transform keys:
```python
classes_for_prompt = {
k.split('/')[-1]: v # "fo/Recipe" → "Recipe"
for k, v in ontology_subset.classes.items()
}
```
2. Update prompt example to match (already uses unprefixed names)
3. Modify `expand_uri()` to handle both formats:
```python
# Try exact match first
if value in ontology_subset.classes:
return ontology_subset.classes[value]['uri']
# Try with prefix
for prefix in ['fo/', 'rdf:', 'rdfs:']:
prefixed = f"{prefix}{value}"
if prefixed in ontology_subset.classes:
return ontology_subset.classes[prefixed]['uri']
```
**Pros**:
- Cleaner, more human-readable
- Matches existing prompt examples
- LLMs work better with simpler tokens
**Cons**:
- Class name collisions if multiple ontologies have same class name
- Loses namespace information
- Requires fallback logic for lookups
### Option B: Use Full Prefixed IDs Consistently
**Approach**: Update examples to use prefixed IDs matching what's shown in the class list.
**Changes**:
1. Update prompt example (ontology-prompt.md:46-52):
```json
[
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "fo/Recipe"},
{"subject": "recipe:cornish-pasty", "predicate": "rdfs:label", "object": "Cornish Pasty"},
{"subject": "recipe:cornish-pasty", "predicate": "fo/produces", "object": "food:cornish-pasty"},
{"subject": "food:cornish-pasty", "predicate": "rdf:type", "object": "fo/Food"}
]
```
2. Add namespace explanation to prompt:
```markdown
## Namespace Prefixes:
- **fo/**: Food Ontology (http://purl.org/ontology/fo/)
- **rdf:**: RDF Schema
- **rdfs:**: RDF Schema
Use these prefixes exactly as shown when referencing classes and properties.
```
3. Keep `expand_uri()` as-is (works correctly when matches found)
**Pros**:
- Input = Output consistency
- No information loss
- Preserves namespace semantics
- Works with multiple ontologies
**Cons**:
- More verbose tokens for LLM
- Requires LLM to track prefixes
### Option C: Hybrid - Show Both Label and ID
**Approach**: Enhance prompt to show both human-readable labels and technical IDs.
**Changes**:
1. Update prompt template:
```jinja
{% for class_id, class_def in classes.items() %}
- **{{class_id}}** (label: "{{class_def.labels[0].value if class_def.labels else class_id}}"){% if class_def.comment %}: {{class_def.comment}}{% endif %}
{% endfor %}
```
Example output:
```markdown
- **fo/Recipe** (label: "Recipe"): A Recipe is a combination...
```
2. Update instructions:
```markdown
When referencing classes:
- Use the full prefixed ID (e.g., "fo/Recipe") in JSON output
- The label (e.g., "Recipe") is for human understanding only
```
**Pros**:
- Clearest for LLM
- Preserves all information
- Explicit about what to use
**Cons**:
- Longer prompt
- More complex template
## Implemented Approach
**Simplified Entity-Relationship-Attribute Format** - completely replaces the old triple-based format.
The new approach was chosen because:
1. **No Information Loss**: Original URIs preserved correctly
2. **Simpler Logic**: No transformation needed, direct dict lookups work
3. **Namespace Safety**: Handles multiple ontologies without collisions
4. **Semantic Correctness**: Maintains RDF/OWL semantics
## Implementation Complete
### What Was Built:
1. **New Prompt Template** (`prompts/ontology-extract-v2.txt`)
- ✅ Clear sections: Entity Types, Relationships, Attributes
- ✅ Example using full type identifiers (`fo/Recipe`, `fo/has_ingredient`)
- ✅ Instructions to use exact identifiers from schema
- ✅ New JSON format with entities/relationships/attributes arrays
2. **Entity Normalization** (`entity_normalizer.py`)
- ✅ `normalize_entity_name()` - Converts names to URI-safe format
- ✅ `normalize_type_identifier()` - Handles slashes in types (`fo/Recipe``fo-recipe`)
- ✅ `build_entity_uri()` - Creates unique URIs using (name, type) tuple
- ✅ `EntityRegistry` - Tracks entities for deduplication
3. **JSON Parser** (`simplified_parser.py`)
- ✅ Parses new format: `{entities: [...], relationships: [...], attributes: [...]}`
- ✅ Supports kebab-case and snake_case field names
- ✅ Returns structured dataclasses
- ✅ Graceful error handling with logging
4. **Triple Converter** (`triple_converter.py`)
- ✅ `convert_entity()` - Generates type + label triples automatically
- ✅ `convert_relationship()` - Connects entity URIs via properties
- ✅ `convert_attribute()` - Adds literal values
- ✅ Looks up full URIs from ontology definitions
5. **Updated Main Processor** (`extract.py`)
- ✅ Removed old triple-based extraction code
- ✅ Added `extract_with_simplified_format()` method
- ✅ Now exclusively uses new simplified format
- ✅ Calls prompt with `extract-with-ontologies-v2` ID
## Test Cases
### Test 1: URI Preservation
```python
# Given ontology class
classes = {"fo/Recipe": {"uri": "http://purl.org/ontology/fo/Recipe", ...}}
# When LLM returns
llm_output = {"subject": "x", "predicate": "rdf:type", "object": "fo/Recipe"}
# Then expanded URI should be
assert expanded == "http://purl.org/ontology/fo/Recipe"
# Not: "https://trustgraph.ai/ontology/food#Recipe"
```
### Test 2: Multi-Ontology Collision
```python
# Given two ontologies
ont1 = {"fo/Recipe": {...}}
ont2 = {"cooking/Recipe": {...}}
# LLM should use full prefix to disambiguate
llm_output = {"object": "fo/Recipe"} # Not just "Recipe"
```
### Test 3: Entity Instance Format
```python
# Given prompt with food ontology
# LLM should create instances like
{"subject": "recipe:cornish-pasty"} # Namespace-style
{"subject": "food:beef"} # Consistent prefix
```
## Open Questions
1. **Should entity instances use namespace prefixes?**
- Current: `"recipe:cornish-pasty"` (arbitrary)
- Alternative: Use ontology prefix `"fo:cornish-pasty"`?
- Alternative: No prefix, expand in URI `"cornish-pasty"` → full URI?
2. **How to handle domain/range in prompt?**
- Currently shows: `(Recipe → Food)`
- Should it be: `(fo/Recipe → fo/Food)`?
3. **Should we validate domain/range constraints?**
- TODO comment at extract.py:470
- Would catch more errors but more complex
4. **What about inverse properties and equivalences?**
- Ontology has `owl:inverseOf`, `owl:equivalentClass`
- Not currently used in extraction
- Should they be?
## Success Metrics
- ✅ Zero URI information loss (100% preservation of original URIs)
- ✅ LLM output format matches input format
- ✅ No ambiguous examples in prompt
- ✅ Tests pass with multiple ontologies
- ✅ Improved extraction quality (measured by valid triple %)
## Alternative Approach: Simplified Extraction Format
### Philosophy
Instead of asking the LLM to understand RDF/OWL semantics, ask it to do what it's good at: **find entities and relationships in text**.
Let the code handle URI construction, RDF conversion, and semantic web formalities.
### Example: Entity Classification
**Input Text:**
```
Cornish pasty is a traditional British pastry filled with meat and vegetables.
```
**Ontology Schema (shown to LLM):**
```markdown
## Entity Types:
- Recipe: A recipe is a combination of ingredients and a method
- Food: A food is something that can be eaten
- Ingredient: An ingredient combines a quantity and a food
```
**What LLM Returns (Simple JSON):**
```json
{
"entities": [
{
"entity": "Cornish pasty",
"type": "Recipe"
}
]
}
```
**What Code Produces (RDF Triples):**
```python
# 1. Normalize entity name + type to ID (type prevents collisions)
entity_id = "recipe-cornish-pasty" # normalize("Cornish pasty", "Recipe")
entity_uri = "https://trustgraph.ai/food/recipe-cornish-pasty"
# Note: Same name, different type = different URI
# "Cornish pasty" (Recipe) → recipe-cornish-pasty
# "Cornish pasty" (Food) → food-cornish-pasty
# 2. Generate triples
triples = [
# Type triple
Triple(
s=Value(value=entity_uri, is_uri=True),
p=Value(value="http://www.w3.org/1999/02/22-rdf-syntax-ns#type", is_uri=True),
o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True)
),
# Label triple (automatic)
Triple(
s=Value(value=entity_uri, is_uri=True),
p=Value(value="http://www.w3.org/2000/01/rdf-schema#label", is_uri=True),
o=Value(value="Cornish pasty", is_uri=False)
)
]
```
### Benefits
1. **LLM doesn't need to:**
- Understand URI syntax
- Invent identifier prefixes (`recipe:`, `ingredient:`)
- Know about `rdf:type` or `rdfs:label`
- Construct semantic web identifiers
2. **LLM just needs to:**
- Find entities in text
- Map them to ontology classes
- Extract relationships and attributes
3. **Code handles:**
- URI normalization and construction
- RDF triple generation
- Automatic label assignment
- Namespace management
### Why This Works Better
- **Simpler prompt** = less confusion = fewer errors
- **Consistent IDs** = code controls normalization rules
- **Auto-generated labels** = no missing rdfs:label triples
- **LLM focuses on extraction** = what it's actually good at
### Example: Entity Relationships
**Input Text:**
```
Cornish pasty is a traditional British pastry filled with beef and potatoes.
```
**Ontology Schema (shown to LLM):**
```markdown
## Entity Types:
- Recipe: A recipe is a combination of ingredients and a method
- Food: A food is something that can be eaten
- Ingredient: An ingredient combines a quantity and a food
## Relationships:
- has_ingredient: Relates a recipe to an ingredient it uses (Recipe → Ingredient)
- food: Relates an ingredient to the food that is required (Ingredient → Food)
```
**What LLM Returns (Simple JSON):**
```json
{
"entities": [
{
"entity": "Cornish pasty",
"type": "Recipe"
},
{
"entity": "beef",
"type": "Food"
},
{
"entity": "potatoes",
"type": "Food"
}
],
"relationships": [
{
"subject": "Cornish pasty",
"subject-type": "Recipe",
"relation": "has_ingredient",
"object": "beef",
"object-type": "Food"
},
{
"subject": "Cornish pasty",
"subject-type": "Recipe",
"relation": "has_ingredient",
"object": "potatoes",
"object-type": "Food"
}
]
}
```
**What Code Produces (RDF Triples):**
```python
# Normalize entity names to URIs
cornish_pasty_uri = "https://trustgraph.ai/food/cornish-pasty"
beef_uri = "https://trustgraph.ai/food/beef"
potatoes_uri = "https://trustgraph.ai/food/potatoes"
# Look up relation URI from ontology
has_ingredient_uri = "http://purl.org/ontology/fo/ingredients" # from fo/has_ingredient
triples = [
# Entity type triples (as before)
Triple(s=cornish_pasty_uri, p=rdf_type, o="http://purl.org/ontology/fo/Recipe"),
Triple(s=cornish_pasty_uri, p=rdfs_label, o="Cornish pasty"),
Triple(s=beef_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"),
Triple(s=beef_uri, p=rdfs_label, o="beef"),
Triple(s=potatoes_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"),
Triple(s=potatoes_uri, p=rdfs_label, o="potatoes"),
# Relationship triples
Triple(
s=Value(value=cornish_pasty_uri, is_uri=True),
p=Value(value=has_ingredient_uri, is_uri=True),
o=Value(value=beef_uri, is_uri=True)
),
Triple(
s=Value(value=cornish_pasty_uri, is_uri=True),
p=Value(value=has_ingredient_uri, is_uri=True),
o=Value(value=potatoes_uri, is_uri=True)
)
]
```
**Key Points:**
- LLM returns natural language entity names: `"Cornish pasty"`, `"beef"`, `"potatoes"`
- LLM includes types to disambiguate: `subject-type`, `object-type`
- LLM uses relation name from schema: `"has_ingredient"`
- Code derives consistent IDs using (name, type): `("Cornish pasty", "Recipe")``recipe-cornish-pasty`
- Code looks up relation URI from ontology: `fo/has_ingredient` → full URI
- Same (name, type) tuple always gets same URI (deduplication)
### Example: Entity Name Disambiguation
**Problem:** Same name can refer to different entity types.
**Real-world case:**
```
"Cornish pasty" can be:
- A Recipe (instructions for making it)
- A Food (the dish itself)
```
**How It's Handled:**
LLM returns both as separate entities:
```json
{
"entities": [
{"entity": "Cornish pasty", "type": "Recipe"},
{"entity": "Cornish pasty", "type": "Food"}
],
"relationships": [
{
"subject": "Cornish pasty",
"subject-type": "Recipe",
"relation": "produces",
"object": "Cornish pasty",
"object-type": "Food"
}
]
}
```
**Code Resolution:**
```python
# Different types → different URIs
recipe_uri = normalize("Cornish pasty", "Recipe")
# → "https://trustgraph.ai/food/recipe-cornish-pasty"
food_uri = normalize("Cornish pasty", "Food")
# → "https://trustgraph.ai/food/food-cornish-pasty"
# Relationship connects them correctly
triple = Triple(
s=recipe_uri, # The Recipe
p="http://purl.org/ontology/fo/produces",
o=food_uri # The Food
)
```
**Why This Works:**
- Type is included in ALL references (entities, relationships, attributes)
- Code uses `(name, type)` tuple as lookup key
- No ambiguity, no collisions
### Example: Entity Attributes
**Input Text:**
```
This Cornish pasty recipe serves 4-6 people and takes 45 minutes to prepare.
```
**Ontology Schema (shown to LLM):**
```markdown
## Entity Types:
- Recipe: A recipe is a combination of ingredients and a method
## Attributes:
- serves: Indicates what the recipe is intended to serve (Recipe → text)
- preparation_time: Time needed to prepare the recipe (Recipe → text)
```
**What LLM Returns (Simple JSON):**
```json
{
"entities": [
{
"entity": "Cornish pasty recipe",
"type": "Recipe"
}
],
"attributes": [
{
"entity": "Cornish pasty recipe",
"entity-type": "Recipe",
"attribute": "serves",
"value": "4-6 people"
},
{
"entity": "Cornish pasty recipe",
"entity-type": "Recipe",
"attribute": "preparation_time",
"value": "45 minutes"
}
]
}
```
**What Code Produces (RDF Triples):**
```python
# Normalize entity name to URI
recipe_uri = "https://trustgraph.ai/food/cornish-pasty-recipe"
# Look up attribute URIs from ontology
serves_uri = "http://purl.org/ontology/fo/serves" # from fo/serves
prep_time_uri = "http://purl.org/ontology/fo/preparation_time" # from fo/preparation_time
triples = [
# Entity type triple
Triple(
s=Value(value=recipe_uri, is_uri=True),
p=Value(value=rdf_type, is_uri=True),
o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True)
),
# Label triple (automatic)
Triple(
s=Value(value=recipe_uri, is_uri=True),
p=Value(value=rdfs_label, is_uri=True),
o=Value(value="Cornish pasty recipe", is_uri=False)
),
# Attribute triples (objects are literals, not URIs)
Triple(
s=Value(value=recipe_uri, is_uri=True),
p=Value(value=serves_uri, is_uri=True),
o=Value(value="4-6 people", is_uri=False) # Literal value!
),
Triple(
s=Value(value=recipe_uri, is_uri=True),
p=Value(value=prep_time_uri, is_uri=True),
o=Value(value="45 minutes", is_uri=False) # Literal value!
)
]
```
**Key Points:**
- LLM extracts literal values: `"4-6 people"`, `"45 minutes"`
- LLM includes entity type for disambiguation: `entity-type`
- LLM uses attribute name from schema: `"serves"`, `"preparation_time"`
- Code looks up attribute URI from ontology datatype properties
- **Object is literal** (`is_uri=False`), not a URI reference
- Values stay as natural text, no normalization needed
**Difference from Relationships:**
- Relationships: both subject and object are entities (URIs)
- Attributes: subject is entity (URI), object is literal value (string/number)
### Complete Example: Entities + Relationships + Attributes
**Input Text:**
```
Cornish pasty is a savory pastry filled with beef and potatoes.
This recipe serves 4 people.
```
**What LLM Returns:**
```json
{
"entities": [
{
"entity": "Cornish pasty",
"type": "Recipe"
},
{
"entity": "beef",
"type": "Food"
},
{
"entity": "potatoes",
"type": "Food"
}
],
"relationships": [
{
"subject": "Cornish pasty",
"subject-type": "Recipe",
"relation": "has_ingredient",
"object": "beef",
"object-type": "Food"
},
{
"subject": "Cornish pasty",
"subject-type": "Recipe",
"relation": "has_ingredient",
"object": "potatoes",
"object-type": "Food"
}
],
"attributes": [
{
"entity": "Cornish pasty",
"entity-type": "Recipe",
"attribute": "serves",
"value": "4 people"
}
]
}
```
**Result:** 11 RDF triples generated:
- 3 entity type triples (rdf:type)
- 3 entity label triples (rdfs:label) - automatic
- 2 relationship triples (has_ingredient)
- 1 attribute triple (serves)
All from simple, natural language extractions by the LLM!
## References
- Current implementation: `trustgraph-flow/trustgraph/extract/kg/ontology/extract.py`
- Prompt template: `ontology-prompt.md`
- Test cases: `tests/unit/test_extract/test_ontology/`
- Example ontology: `e2e/test-data/food.ontology`

54
ontology-prompt.md Normal file
View file

@ -0,0 +1,54 @@
You are a knowledge extraction expert. Extract structured triples from text using ONLY the provided ontology elements.
## Ontology Classes:
{% for class_id, class_def in classes.items() %}
- **{{class_id}}**{% if class_def.subclass_of %} (subclass of {{class_def.subclass_of}}){% endif %}{% if class_def.comment %}: {{class_def.comment}}{% endif %}
{% endfor %}
## Object Properties (connect entities):
{% for prop_id, prop_def in object_properties.items() %}
- **{{prop_id}}**{% if prop_def.domain and prop_def.range %} ({{prop_def.domain}} → {{prop_def.range}}){% endif %}{% if prop_def.comment %}: {{prop_def.comment}}{% endif %}
{% endfor %}
## Datatype Properties (entity attributes):
{% for prop_id, prop_def in datatype_properties.items() %}
- **{{prop_id}}**{% if prop_def.domain and prop_def.range %} ({{prop_def.domain}} → {{prop_def.range}}){% endif %}{% if prop_def.comment %}: {{prop_def.comment}}{% endif %}
{% endfor %}
## Text to Analyze:
{{text}}
## Extraction Rules:
1. Only use classes defined above for entity types
2. Only use properties defined above for relationships and attributes
3. Respect domain and range constraints where specified
4. For class instances, use `rdf:type` as the predicate
5. Include `rdfs:label` for new entities to provide human-readable names
6. Extract all relevant triples that can be inferred from the text
7. Use entity URIs or meaningful identifiers as subjects/objects
## Output Format:
Return ONLY a valid JSON array (no markdown, no code blocks) containing objects with these fields:
- "subject": the subject entity (URI or identifier)
- "predicate": the property (from ontology or rdf:type/rdfs:label)
- "object": the object entity or literal value
Important: Return raw JSON only, with no markdown formatting, no code blocks, and no backticks.
## Example Output:
[
{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"},
{"subject": "recipe:cornish-pasty", "predicate": "rdfs:label", "object": "Cornish Pasty"},
{"subject": "recipe:cornish-pasty", "predicate": "has_ingredient", "object": "ingredient:flour"},
{"subject": "ingredient:flour", "predicate": "rdf:type", "object": "Ingredient"},
{"subject": "ingredient:flour", "predicate": "rdfs:label", "object": "plain flour"}
]
Now extract triples from the text above.

View file

@ -0,0 +1,164 @@
"""
Entity URI normalization for ontology-based knowledge extraction.
Converts entity names and types into consistent, collision-free URIs.
"""
import re
from typing import Tuple
def normalize_entity_name(entity_name: str) -> str:
"""Normalize entity name to URI-safe identifier.
Args:
entity_name: Natural language entity name (e.g., "Cornish pasty")
Returns:
Normalized identifier (e.g., "cornish-pasty")
"""
# Convert to lowercase
normalized = entity_name.lower()
# Replace spaces and underscores with hyphens
normalized = re.sub(r'[\s_]+', '-', normalized)
# Remove any characters that aren't alphanumeric, hyphens, or periods
normalized = re.sub(r'[^a-z0-9\-.]', '', normalized)
# Remove leading/trailing hyphens
normalized = normalized.strip('-')
# Collapse multiple hyphens
normalized = re.sub(r'-+', '-', normalized)
return normalized
def normalize_type_identifier(type_id: str) -> str:
"""Normalize ontology type identifier to URI-safe format.
Handles prefixed types like "fo/Recipe" by converting to "fo-recipe".
Args:
type_id: Ontology type identifier (e.g., "fo/Recipe", "Food")
Returns:
Normalized type identifier (e.g., "fo-recipe", "food")
"""
# Convert to lowercase
normalized = type_id.lower()
# Replace slashes, colons, and spaces with hyphens
normalized = re.sub(r'[/:.\s_]+', '-', normalized)
# Remove any remaining non-alphanumeric characters except hyphens
normalized = re.sub(r'[^a-z0-9\-]', '', normalized)
# Remove leading/trailing hyphens
normalized = normalized.strip('-')
# Collapse multiple hyphens
normalized = re.sub(r'-+', '-', normalized)
return normalized
def build_entity_uri(entity_name: str, entity_type: str, ontology_id: str,
base_uri: str = "https://trustgraph.ai") -> str:
"""Build a unique URI for an entity based on its name and type.
The type is included in the URI to prevent collisions when the same
name refers to different entity types (e.g., "Cornish pasty" as both
Recipe and Food).
Args:
entity_name: Natural language entity name (e.g., "Cornish pasty")
entity_type: Ontology type (e.g., "fo/Recipe")
ontology_id: Ontology identifier (e.g., "food")
base_uri: Base URI for entity URIs (default: "https://trustgraph.ai")
Returns:
Full entity URI (e.g., "https://trustgraph.ai/food/fo-recipe-cornish-pasty")
Examples:
>>> build_entity_uri("Cornish pasty", "fo/Recipe", "food")
'https://trustgraph.ai/food/fo-recipe-cornish-pasty'
>>> build_entity_uri("Cornish pasty", "fo/Food", "food")
'https://trustgraph.ai/food/fo-food-cornish-pasty'
>>> build_entity_uri("beef", "fo/Food", "food")
'https://trustgraph.ai/food/fo-food-beef'
"""
type_part = normalize_type_identifier(entity_type)
name_part = normalize_entity_name(entity_name)
# Combine type and name to ensure uniqueness
entity_id = f"{type_part}-{name_part}"
# Build full URI
return f"{base_uri}/{ontology_id}/{entity_id}"
class EntityRegistry:
"""Registry to track entity name/type tuples and their assigned URIs.
Ensures that the same (entity_name, entity_type) tuple always maps
to the same URI, enabling deduplication across the extraction process.
"""
def __init__(self, ontology_id: str, base_uri: str = "https://trustgraph.ai"):
"""Initialize the entity registry.
Args:
ontology_id: Ontology identifier (e.g., "food")
base_uri: Base URI for entity URIs
"""
self.ontology_id = ontology_id
self.base_uri = base_uri
self._registry = {} # (entity_name, entity_type) -> uri
def get_or_create_uri(self, entity_name: str, entity_type: str) -> str:
"""Get existing URI or create new one for entity.
Args:
entity_name: Natural language entity name
entity_type: Ontology type identifier
Returns:
URI for this entity (same URI for same name/type tuple)
"""
key = (entity_name, entity_type)
if key not in self._registry:
uri = build_entity_uri(
entity_name,
entity_type,
self.ontology_id,
self.base_uri
)
self._registry[key] = uri
return self._registry[key]
def lookup(self, entity_name: str, entity_type: str) -> str:
"""Look up URI for entity (returns None if not registered).
Args:
entity_name: Natural language entity name
entity_type: Ontology type identifier
Returns:
URI for this entity, or None if not found
"""
key = (entity_name, entity_type)
return self._registry.get(key)
def clear(self):
"""Clear all registered entities."""
self._registry.clear()
def size(self) -> int:
"""Get number of registered entities."""
return len(self._registry)

View file

@ -20,6 +20,8 @@ from .ontology_embedder import OntologyEmbedder
from .vector_store import InMemoryVectorStore from .vector_store import InMemoryVectorStore
from .text_processor import TextProcessor from .text_processor import TextProcessor
from .ontology_selector import OntologySelector, OntologySubset from .ontology_selector import OntologySelector, OntologySubset
from .simplified_parser import parse_extraction_response
from .triple_converter import TripleConverter
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -298,25 +300,10 @@ class Processor(FlowProcessor):
# Build extraction prompt variables # Build extraction prompt variables
prompt_variables = self.build_extraction_variables(chunk, ontology_subset) prompt_variables = self.build_extraction_variables(chunk, ontology_subset)
# Call prompt service for extraction # Extract using simplified entity-relationship-attribute format
try: triples = await self.extract_with_simplified_format(
# Use prompt() method with extract-with-ontologies prompt ID flow, chunk, ontology_subset, prompt_variables
triples_response = await flow("prompt-request").prompt(
id="extract-with-ontologies",
variables=prompt_variables
) )
logger.debug(f"Extraction response: {triples_response}")
if not isinstance(triples_response, list):
logger.error("Expected list of triples from prompt service")
triples_response = []
except Exception as e:
logger.error(f"Prompt service error: {e}", exc_info=True)
triples_response = []
# Parse and validate triples
triples = self.parse_and_validate_triples(triples_response, ontology_subset)
# Add metadata triples # Add metadata triples
for t in v.metadata.metadata: for t in v.metadata.metadata:
@ -362,6 +349,55 @@ class Processor(FlowProcessor):
[] []
) )
async def extract_with_simplified_format(
self,
flow,
chunk: str,
ontology_subset: OntologySubset,
prompt_variables: Dict[str, Any]
) -> List[Triple]:
"""Extract triples using simplified entity-relationship-attribute format.
Args:
flow: Flow object for accessing services
chunk: Text chunk to extract from
ontology_subset: Selected ontology subset
prompt_variables: Variables for prompt template
Returns:
List of Triple objects
"""
try:
# Call prompt service with simplified format prompt
extraction_response = await flow("prompt-request").prompt(
id="extract-with-ontologies",
variables=prompt_variables
)
logger.debug(f"Simplified extraction response: {extraction_response}")
# Parse response into structured format
extraction_result = parse_extraction_response(extraction_response)
if not extraction_result:
logger.warning("Failed to parse extraction response")
return []
logger.info(f"Parsed {len(extraction_result.entities)} entities, "
f"{len(extraction_result.relationships)} relationships, "
f"{len(extraction_result.attributes)} attributes")
# Convert to RDF triples
converter = TripleConverter(ontology_subset, ontology_subset.ontology_id)
triples = converter.convert_all(extraction_result)
logger.info(f"Generated {len(triples)} RDF triples from simplified extraction")
return triples
except Exception as e:
logger.error(f"Simplified extraction error: {e}", exc_info=True)
return []
def build_extraction_variables(self, chunk: str, ontology_subset: OntologySubset) -> Dict[str, Any]: def build_extraction_variables(self, chunk: str, ontology_subset: OntologySubset) -> Dict[str, Any]:
"""Build variables for ontology-based extraction prompt template. """Build variables for ontology-based extraction prompt template.

View file

@ -0,0 +1,234 @@
"""
Parser for simplified ontology extraction JSON format.
Parses the new entity-relationship-attribute format from LLM responses.
"""
import json
import logging
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
logger = logging.getLogger(__name__)
@dataclass
class Entity:
"""Represents an extracted entity."""
entity: str
type: str
@dataclass
class Relationship:
"""Represents an extracted relationship."""
subject: str
subject_type: str
relation: str
object: str
object_type: str
@dataclass
class Attribute:
"""Represents an extracted attribute."""
entity: str
entity_type: str
attribute: str
value: str
@dataclass
class ExtractionResult:
"""Complete extraction result."""
entities: List[Entity]
relationships: List[Relationship]
attributes: List[Attribute]
def parse_extraction_response(response: Any) -> Optional[ExtractionResult]:
"""Parse LLM extraction response into structured format.
Args:
response: LLM response (string JSON or already parsed dict)
Returns:
ExtractionResult with parsed entities/relationships/attributes,
or None if parsing fails
"""
# Handle string response (parse JSON)
if isinstance(response, str):
try:
data = json.loads(response)
except json.JSONDecodeError as e:
logger.error(f"Failed to parse JSON response: {e}")
logger.debug(f"Response was: {response[:500]}")
return None
elif isinstance(response, dict):
data = response
else:
logger.error(f"Unexpected response type: {type(response)}")
return None
# Validate structure
if not isinstance(data, dict):
logger.error(f"Expected dict, got {type(data)}")
return None
# Parse entities
entities = []
entities_data = data.get('entities', [])
if not isinstance(entities_data, list):
logger.warning(f"'entities' is not a list: {type(entities_data)}")
entities_data = []
for entity_data in entities_data:
try:
entity = parse_entity(entity_data)
if entity:
entities.append(entity)
except Exception as e:
logger.warning(f"Failed to parse entity {entity_data}: {e}")
# Parse relationships
relationships = []
relationships_data = data.get('relationships', [])
if not isinstance(relationships_data, list):
logger.warning(f"'relationships' is not a list: {type(relationships_data)}")
relationships_data = []
for rel_data in relationships_data:
try:
relationship = parse_relationship(rel_data)
if relationship:
relationships.append(relationship)
except Exception as e:
logger.warning(f"Failed to parse relationship {rel_data}: {e}")
# Parse attributes
attributes = []
attributes_data = data.get('attributes', [])
if not isinstance(attributes_data, list):
logger.warning(f"'attributes' is not a list: {type(attributes_data)}")
attributes_data = []
for attr_data in attributes_data:
try:
attribute = parse_attribute(attr_data)
if attribute:
attributes.append(attribute)
except Exception as e:
logger.warning(f"Failed to parse attribute {attr_data}: {e}")
return ExtractionResult(
entities=entities,
relationships=relationships,
attributes=attributes
)
def parse_entity(data: Dict[str, Any]) -> Optional[Entity]:
"""Parse entity from dict.
Supports both kebab-case and snake_case field names for compatibility.
Args:
data: Entity dict with 'entity' and 'type' fields
Returns:
Entity object or None if invalid
"""
if not isinstance(data, dict):
logger.warning(f"Entity data is not a dict: {type(data)}")
return None
entity = data.get('entity')
entity_type = data.get('type')
if not entity or not entity_type:
logger.warning(f"Missing required fields in entity: {data}")
return None
if not isinstance(entity, str) or not isinstance(entity_type, str):
logger.warning(f"Entity fields must be strings: {data}")
return None
return Entity(entity=entity, type=entity_type)
def parse_relationship(data: Dict[str, Any]) -> Optional[Relationship]:
"""Parse relationship from dict.
Supports both kebab-case and snake_case field names for compatibility.
Args:
data: Relationship dict with subject, subject-type, relation, object, object-type
Returns:
Relationship object or None if invalid
"""
if not isinstance(data, dict):
logger.warning(f"Relationship data is not a dict: {type(data)}")
return None
subject = data.get('subject')
subject_type = data.get('subject-type') or data.get('subject_type')
relation = data.get('relation')
obj = data.get('object')
object_type = data.get('object-type') or data.get('object_type')
if not all([subject, subject_type, relation, obj, object_type]):
logger.warning(f"Missing required fields in relationship: {data}")
return None
if not all(isinstance(v, str) for v in [subject, subject_type, relation, obj, object_type]):
logger.warning(f"Relationship fields must be strings: {data}")
return None
return Relationship(
subject=subject,
subject_type=subject_type,
relation=relation,
object=obj,
object_type=object_type
)
def parse_attribute(data: Dict[str, Any]) -> Optional[Attribute]:
"""Parse attribute from dict.
Supports both kebab-case and snake_case field names for compatibility.
Args:
data: Attribute dict with entity, entity-type, attribute, value
Returns:
Attribute object or None if invalid
"""
if not isinstance(data, dict):
logger.warning(f"Attribute data is not a dict: {type(data)}")
return None
entity = data.get('entity')
entity_type = data.get('entity-type') or data.get('entity_type')
attribute = data.get('attribute')
value = data.get('value')
if not all([entity, entity_type, attribute, value is not None]):
logger.warning(f"Missing required fields in attribute: {data}")
return None
if not all(isinstance(v, str) for v in [entity, entity_type, attribute]):
logger.warning(f"Attribute fields must be strings: {data}")
return None
# Value can be string, number, bool - convert to string
if not isinstance(value, str):
value = str(value)
return Attribute(
entity=entity,
entity_type=entity_type,
attribute=attribute,
value=value
)

View file

@ -0,0 +1,228 @@
"""
Converts simplified extraction format to RDF triples.
Transforms entities, relationships, and attributes into proper RDF triples
with full URIs and correct is_uri flags.
"""
import logging
from typing import List, Optional
from .... schema import Triple, Value
from .... rdf import RDF_TYPE, RDF_LABEL
from .simplified_parser import Entity, Relationship, Attribute, ExtractionResult
from .entity_normalizer import EntityRegistry
from .ontology_selector import OntologySubset
logger = logging.getLogger(__name__)
class TripleConverter:
"""Converts extraction results to RDF triples."""
def __init__(self, ontology_subset: OntologySubset, ontology_id: str):
"""Initialize converter.
Args:
ontology_subset: Ontology subset with classes and properties
ontology_id: Ontology identifier for URI generation
"""
self.ontology_subset = ontology_subset
self.ontology_id = ontology_id
self.entity_registry = EntityRegistry(ontology_id)
def convert_all(self, extraction: ExtractionResult) -> List[Triple]:
"""Convert complete extraction result to RDF triples.
Args:
extraction: Parsed extraction with entities/relationships/attributes
Returns:
List of RDF Triple objects
"""
triples = []
# Convert entities (generates type + label triples)
for entity in extraction.entities:
entity_triples = self.convert_entity(entity)
triples.extend(entity_triples)
# Convert relationships
for relationship in extraction.relationships:
rel_triple = self.convert_relationship(relationship)
if rel_triple:
triples.append(rel_triple)
# Convert attributes
for attribute in extraction.attributes:
attr_triple = self.convert_attribute(attribute)
if attr_triple:
triples.append(attr_triple)
return triples
def convert_entity(self, entity: Entity) -> List[Triple]:
"""Convert entity to RDF triples (type + label).
Args:
entity: Entity object with name and type
Returns:
List containing type triple and label triple
"""
triples = []
# Get or create URI for this entity
entity_uri = self.entity_registry.get_or_create_uri(
entity.entity,
entity.type
)
# Look up class URI from ontology
class_uri = self._get_class_uri(entity.type)
if not class_uri:
logger.warning(f"Unknown entity type '{entity.type}', skipping entity '{entity.entity}'")
return triples
# Generate type triple: entity rdf:type ClassURI
type_triple = Triple(
s=Value(value=entity_uri, is_uri=True),
p=Value(value=RDF_TYPE, is_uri=True),
o=Value(value=class_uri, is_uri=True)
)
triples.append(type_triple)
# Generate label triple: entity rdfs:label "entity name"
label_triple = Triple(
s=Value(value=entity_uri, is_uri=True),
p=Value(value=RDF_LABEL, is_uri=True),
o=Value(value=entity.entity, is_uri=False) # Literal!
)
triples.append(label_triple)
return triples
def convert_relationship(self, relationship: Relationship) -> Optional[Triple]:
"""Convert relationship to RDF triple.
Args:
relationship: Relationship with subject/object entities and relation
Returns:
Triple connecting two entity URIs via property URI, or None if invalid
"""
# Get URIs for subject and object entities
subject_uri = self.entity_registry.get_or_create_uri(
relationship.subject,
relationship.subject_type
)
object_uri = self.entity_registry.get_or_create_uri(
relationship.object,
relationship.object_type
)
# Look up property URI from ontology
property_uri = self._get_object_property_uri(relationship.relation)
if not property_uri:
logger.warning(f"Unknown relationship '{relationship.relation}', skipping")
return None
# Generate triple: subject property object
return Triple(
s=Value(value=subject_uri, is_uri=True),
p=Value(value=property_uri, is_uri=True),
o=Value(value=object_uri, is_uri=True)
)
def convert_attribute(self, attribute: Attribute) -> Optional[Triple]:
"""Convert attribute to RDF triple.
Args:
attribute: Attribute with entity, attribute name, and literal value
Returns:
Triple with entity URI, property URI, and literal value, or None if invalid
"""
# Get URI for entity
entity_uri = self.entity_registry.get_or_create_uri(
attribute.entity,
attribute.entity_type
)
# Look up property URI from ontology
property_uri = self._get_datatype_property_uri(attribute.attribute)
if not property_uri:
logger.warning(f"Unknown attribute '{attribute.attribute}', skipping")
return None
# Generate triple: entity property "literal value"
return Triple(
s=Value(value=entity_uri, is_uri=True),
p=Value(value=property_uri, is_uri=True),
o=Value(value=attribute.value, is_uri=False) # Literal!
)
def _get_class_uri(self, class_id: str) -> Optional[str]:
"""Get full URI for ontology class.
Args:
class_id: Class identifier (e.g., "fo/Recipe")
Returns:
Full class URI or None if not found
"""
if class_id not in self.ontology_subset.classes:
return None
class_def = self.ontology_subset.classes[class_id]
# Extract URI from class definition
if isinstance(class_def, dict) and 'uri' in class_def:
return class_def['uri']
# Fallback: construct URI
return f"https://trustgraph.ai/ontology/{self.ontology_id}#{class_id}"
def _get_object_property_uri(self, property_id: str) -> Optional[str]:
"""Get full URI for object property.
Args:
property_id: Property identifier (e.g., "fo/has_ingredient")
Returns:
Full property URI or None if not found
"""
if property_id not in self.ontology_subset.object_properties:
return None
prop_def = self.ontology_subset.object_properties[property_id]
# Extract URI from property definition
if isinstance(prop_def, dict) and 'uri' in prop_def:
return prop_def['uri']
# Fallback: construct URI
return f"https://trustgraph.ai/ontology/{self.ontology_id}#{property_id}"
def _get_datatype_property_uri(self, property_id: str) -> Optional[str]:
"""Get full URI for datatype property.
Args:
property_id: Property identifier (e.g., "fo/serves")
Returns:
Full property URI or None if not found
"""
if property_id not in self.ontology_subset.datatype_properties:
return None
prop_def = self.ontology_subset.datatype_properties[property_id]
# Extract URI from property definition
if isinstance(prop_def, dict) and 'uri' in prop_def:
return prop_def['uri']
# Fallback: construct URI
return f"https://trustgraph.ai/ontology/{self.ontology_id}#{property_id}"