diff --git a/docs/tech-specs/ontology-extract-phase-2.md b/docs/tech-specs/ontology-extract-phase-2.md new file mode 100644 index 00000000..ac1a0543 --- /dev/null +++ b/docs/tech-specs/ontology-extract-phase-2.md @@ -0,0 +1,761 @@ +# Ontology Knowledge Extraction - Phase 2 Refactor + +**Status**: Draft +**Author**: Analysis Session 2025-12-03 +**Related**: `ontology.md`, `ontorag.md` + +## Overview + +This document identifies inconsistencies in the current ontology-based knowledge extraction system and proposes a refactor to improve LLM performance and reduce information loss. + +## Current Implementation + +### How It Works Now + +1. **Ontology Loading** (`ontology_loader.py`) + - Loads ontology JSON with keys like `"fo/Recipe"`, `"fo/Food"`, `"fo/produces"` + - Class IDs include namespace prefix in the key itself + - Example from `food.ontology`: + ```json + "classes": { + "fo/Recipe": { + "uri": "http://purl.org/ontology/fo/Recipe", + "rdfs:comment": "A Recipe is a combination..." + } + } + ``` + +2. **Prompt Construction** (`extract.py:299-307`, `ontology-prompt.md`) + - Template receives `classes`, `object_properties`, `datatype_properties` dicts + - Template iterates: `{% for class_id, class_def in classes.items() %}` + - LLM sees: `**fo/Recipe**: A Recipe is a combination...` + - Example output format shows: + ```json + {"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"} + {"subject": "recipe:cornish-pasty", "predicate": "has_ingredient", "object": "ingredient:flour"} + ``` + +3. **Response Parsing** (`extract.py:382-428`) + - Expects JSON array: `[{"subject": "...", "predicate": "...", "object": "..."}]` + - Validates against ontology subset + - Expands URIs via `expand_uri()` (extract.py:473-521) + +4. **URI Expansion** (`extract.py:473-521`) + - Checks if value is in `ontology_subset.classes` dict + - If found, extracts URI from class definition + - If not found, constructs URI: `f"https://trustgraph.ai/ontology/{ontology_id}#{value}"` + +### Data Flow Example + +**Ontology JSON → Loader → Prompt:** +``` +"fo/Recipe" → classes["fo/Recipe"] → LLM sees "**fo/Recipe**" +``` + +**LLM → Parser → Output:** +``` +"Recipe" → not in classes["fo/Recipe"] → constructs URI → LOSES original URI +"fo/Recipe" → found in classes → uses original URI → PRESERVES URI +``` + +## Problems Identified + +### 1. **Inconsistent Examples in Prompt** + +**Issue**: The prompt template shows class IDs with prefixes (`fo/Recipe`) but the example output uses unprefixed class names (`Recipe`). + +**Location**: `ontology-prompt.md:5-52` + +```markdown +## Ontology Classes: +- **fo/Recipe**: A Recipe is... + +## Example Output: +{"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"} +``` + +**Impact**: LLM receives conflicting signals about what format to use. + +### 2. **Information Loss in URI Expansion** + +**Issue**: When LLM returns unprefixed class names following the example, `expand_uri()` can't find them in the ontology dict and constructs fallback URIs, losing the original proper URIs. + +**Location**: `extract.py:494-500` + +```python +if value in ontology_subset.classes: # Looks for "Recipe" + class_def = ontology_subset.classes[value] # But key is "fo/Recipe" + if isinstance(class_def, dict) and 'uri' in class_def: + return class_def['uri'] # Never reached! +return f"https://trustgraph.ai/ontology/{ontology_id}#{value}" # Fallback +``` + +**Impact**: +- Original URI: `http://purl.org/ontology/fo/Recipe` +- Constructed URI: `https://trustgraph.ai/ontology/food#Recipe` +- Semantic meaning lost, breaks interoperability + +### 3. **Ambiguous Entity Instance Format** + +**Issue**: No clear guidance on entity instance URI format. + +**Examples in prompt**: +- `"recipe:cornish-pasty"` (namespace-like prefix) +- `"ingredient:flour"` (different prefix) + +**Actual behavior** (extract.py:517-520): +```python +# Treat as entity instance - construct unique URI +normalized = value.replace(" ", "-").lower() +return f"https://trustgraph.ai/{ontology_id}/{normalized}" +``` + +**Impact**: LLM must guess prefixing convention with no ontology context. + +### 4. **No Namespace Prefix Guidance** + +**Issue**: The ontology JSON contains namespace definitions (line 10-25 in food.ontology): +```json +"namespaces": { + "fo": "http://purl.org/ontology/fo/", + "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#", + ... +} +``` + +But these are never surfaced to the LLM. The LLM doesn't know: +- What "fo" means +- What prefix to use for entities +- Which namespace applies to which elements + +### 5. **Labels Not Used in Prompt** + +**Issue**: Every class has `rdfs:label` fields (e.g., `{"value": "Recipe", "lang": "en-gb"}`), but the prompt template doesn't use them. + +**Current**: Shows only `class_id` and `comment` +```jinja +- **{{class_id}}**{% if class_def.comment %}: {{class_def.comment}}{% endif %} +``` + +**Available but unused**: +```python +"rdfs:label": [{"value": "Recipe", "lang": "en-gb"}] +``` + +**Impact**: Could provide human-readable names alongside technical IDs. + +## Proposed Solutions + +### Option A: Normalize to Unprefixed IDs + +**Approach**: Strip prefixes from class IDs before showing to LLM. + +**Changes**: +1. Modify `build_extraction_variables()` to transform keys: + ```python + classes_for_prompt = { + k.split('/')[-1]: v # "fo/Recipe" → "Recipe" + for k, v in ontology_subset.classes.items() + } + ``` + +2. Update prompt example to match (already uses unprefixed names) + +3. Modify `expand_uri()` to handle both formats: + ```python + # Try exact match first + if value in ontology_subset.classes: + return ontology_subset.classes[value]['uri'] + + # Try with prefix + for prefix in ['fo/', 'rdf:', 'rdfs:']: + prefixed = f"{prefix}{value}" + if prefixed in ontology_subset.classes: + return ontology_subset.classes[prefixed]['uri'] + ``` + +**Pros**: +- Cleaner, more human-readable +- Matches existing prompt examples +- LLMs work better with simpler tokens + +**Cons**: +- Class name collisions if multiple ontologies have same class name +- Loses namespace information +- Requires fallback logic for lookups + +### Option B: Use Full Prefixed IDs Consistently + +**Approach**: Update examples to use prefixed IDs matching what's shown in the class list. + +**Changes**: +1. Update prompt example (ontology-prompt.md:46-52): + ```json + [ + {"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "fo/Recipe"}, + {"subject": "recipe:cornish-pasty", "predicate": "rdfs:label", "object": "Cornish Pasty"}, + {"subject": "recipe:cornish-pasty", "predicate": "fo/produces", "object": "food:cornish-pasty"}, + {"subject": "food:cornish-pasty", "predicate": "rdf:type", "object": "fo/Food"} + ] + ``` + +2. Add namespace explanation to prompt: + ```markdown + ## Namespace Prefixes: + - **fo/**: Food Ontology (http://purl.org/ontology/fo/) + - **rdf:**: RDF Schema + - **rdfs:**: RDF Schema + + Use these prefixes exactly as shown when referencing classes and properties. + ``` + +3. Keep `expand_uri()` as-is (works correctly when matches found) + +**Pros**: +- Input = Output consistency +- No information loss +- Preserves namespace semantics +- Works with multiple ontologies + +**Cons**: +- More verbose tokens for LLM +- Requires LLM to track prefixes + +### Option C: Hybrid - Show Both Label and ID + +**Approach**: Enhance prompt to show both human-readable labels and technical IDs. + +**Changes**: +1. Update prompt template: + ```jinja + {% for class_id, class_def in classes.items() %} + - **{{class_id}}** (label: "{{class_def.labels[0].value if class_def.labels else class_id}}"){% if class_def.comment %}: {{class_def.comment}}{% endif %} + {% endfor %} + ``` + + Example output: + ```markdown + - **fo/Recipe** (label: "Recipe"): A Recipe is a combination... + ``` + +2. Update instructions: + ```markdown + When referencing classes: + - Use the full prefixed ID (e.g., "fo/Recipe") in JSON output + - The label (e.g., "Recipe") is for human understanding only + ``` + +**Pros**: +- Clearest for LLM +- Preserves all information +- Explicit about what to use + +**Cons**: +- Longer prompt +- More complex template + +## Implemented Approach + +**Simplified Entity-Relationship-Attribute Format** - completely replaces the old triple-based format. + +The new approach was chosen because: + +1. **No Information Loss**: Original URIs preserved correctly +2. **Simpler Logic**: No transformation needed, direct dict lookups work +3. **Namespace Safety**: Handles multiple ontologies without collisions +4. **Semantic Correctness**: Maintains RDF/OWL semantics + +## Implementation Complete + +### What Was Built: + +1. **New Prompt Template** (`prompts/ontology-extract-v2.txt`) + - ✅ Clear sections: Entity Types, Relationships, Attributes + - ✅ Example using full type identifiers (`fo/Recipe`, `fo/has_ingredient`) + - ✅ Instructions to use exact identifiers from schema + - ✅ New JSON format with entities/relationships/attributes arrays + +2. **Entity Normalization** (`entity_normalizer.py`) + - ✅ `normalize_entity_name()` - Converts names to URI-safe format + - ✅ `normalize_type_identifier()` - Handles slashes in types (`fo/Recipe` → `fo-recipe`) + - ✅ `build_entity_uri()` - Creates unique URIs using (name, type) tuple + - ✅ `EntityRegistry` - Tracks entities for deduplication + +3. **JSON Parser** (`simplified_parser.py`) + - ✅ Parses new format: `{entities: [...], relationships: [...], attributes: [...]}` + - ✅ Supports kebab-case and snake_case field names + - ✅ Returns structured dataclasses + - ✅ Graceful error handling with logging + +4. **Triple Converter** (`triple_converter.py`) + - ✅ `convert_entity()` - Generates type + label triples automatically + - ✅ `convert_relationship()` - Connects entity URIs via properties + - ✅ `convert_attribute()` - Adds literal values + - ✅ Looks up full URIs from ontology definitions + +5. **Updated Main Processor** (`extract.py`) + - ✅ Removed old triple-based extraction code + - ✅ Added `extract_with_simplified_format()` method + - ✅ Now exclusively uses new simplified format + - ✅ Calls prompt with `extract-with-ontologies-v2` ID + +## Test Cases + +### Test 1: URI Preservation +```python +# Given ontology class +classes = {"fo/Recipe": {"uri": "http://purl.org/ontology/fo/Recipe", ...}} + +# When LLM returns +llm_output = {"subject": "x", "predicate": "rdf:type", "object": "fo/Recipe"} + +# Then expanded URI should be +assert expanded == "http://purl.org/ontology/fo/Recipe" +# Not: "https://trustgraph.ai/ontology/food#Recipe" +``` + +### Test 2: Multi-Ontology Collision +```python +# Given two ontologies +ont1 = {"fo/Recipe": {...}} +ont2 = {"cooking/Recipe": {...}} + +# LLM should use full prefix to disambiguate +llm_output = {"object": "fo/Recipe"} # Not just "Recipe" +``` + +### Test 3: Entity Instance Format +```python +# Given prompt with food ontology +# LLM should create instances like +{"subject": "recipe:cornish-pasty"} # Namespace-style +{"subject": "food:beef"} # Consistent prefix +``` + +## Open Questions + +1. **Should entity instances use namespace prefixes?** + - Current: `"recipe:cornish-pasty"` (arbitrary) + - Alternative: Use ontology prefix `"fo:cornish-pasty"`? + - Alternative: No prefix, expand in URI `"cornish-pasty"` → full URI? + +2. **How to handle domain/range in prompt?** + - Currently shows: `(Recipe → Food)` + - Should it be: `(fo/Recipe → fo/Food)`? + +3. **Should we validate domain/range constraints?** + - TODO comment at extract.py:470 + - Would catch more errors but more complex + +4. **What about inverse properties and equivalences?** + - Ontology has `owl:inverseOf`, `owl:equivalentClass` + - Not currently used in extraction + - Should they be? + +## Success Metrics + +- ✅ Zero URI information loss (100% preservation of original URIs) +- ✅ LLM output format matches input format +- ✅ No ambiguous examples in prompt +- ✅ Tests pass with multiple ontologies +- ✅ Improved extraction quality (measured by valid triple %) + +## Alternative Approach: Simplified Extraction Format + +### Philosophy + +Instead of asking the LLM to understand RDF/OWL semantics, ask it to do what it's good at: **find entities and relationships in text**. + +Let the code handle URI construction, RDF conversion, and semantic web formalities. + +### Example: Entity Classification + +**Input Text:** +``` +Cornish pasty is a traditional British pastry filled with meat and vegetables. +``` + +**Ontology Schema (shown to LLM):** +```markdown +## Entity Types: +- Recipe: A recipe is a combination of ingredients and a method +- Food: A food is something that can be eaten +- Ingredient: An ingredient combines a quantity and a food +``` + +**What LLM Returns (Simple JSON):** +```json +{ + "entities": [ + { + "entity": "Cornish pasty", + "type": "Recipe" + } + ] +} +``` + +**What Code Produces (RDF Triples):** +```python +# 1. Normalize entity name + type to ID (type prevents collisions) +entity_id = "recipe-cornish-pasty" # normalize("Cornish pasty", "Recipe") +entity_uri = "https://trustgraph.ai/food/recipe-cornish-pasty" + +# Note: Same name, different type = different URI +# "Cornish pasty" (Recipe) → recipe-cornish-pasty +# "Cornish pasty" (Food) → food-cornish-pasty + +# 2. Generate triples +triples = [ + # Type triple + Triple( + s=Value(value=entity_uri, is_uri=True), + p=Value(value="http://www.w3.org/1999/02/22-rdf-syntax-ns#type", is_uri=True), + o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True) + ), + # Label triple (automatic) + Triple( + s=Value(value=entity_uri, is_uri=True), + p=Value(value="http://www.w3.org/2000/01/rdf-schema#label", is_uri=True), + o=Value(value="Cornish pasty", is_uri=False) + ) +] +``` + +### Benefits + +1. **LLM doesn't need to:** + - Understand URI syntax + - Invent identifier prefixes (`recipe:`, `ingredient:`) + - Know about `rdf:type` or `rdfs:label` + - Construct semantic web identifiers + +2. **LLM just needs to:** + - Find entities in text + - Map them to ontology classes + - Extract relationships and attributes + +3. **Code handles:** + - URI normalization and construction + - RDF triple generation + - Automatic label assignment + - Namespace management + +### Why This Works Better + +- **Simpler prompt** = less confusion = fewer errors +- **Consistent IDs** = code controls normalization rules +- **Auto-generated labels** = no missing rdfs:label triples +- **LLM focuses on extraction** = what it's actually good at + +### Example: Entity Relationships + +**Input Text:** +``` +Cornish pasty is a traditional British pastry filled with beef and potatoes. +``` + +**Ontology Schema (shown to LLM):** +```markdown +## Entity Types: +- Recipe: A recipe is a combination of ingredients and a method +- Food: A food is something that can be eaten +- Ingredient: An ingredient combines a quantity and a food + +## Relationships: +- has_ingredient: Relates a recipe to an ingredient it uses (Recipe → Ingredient) +- food: Relates an ingredient to the food that is required (Ingredient → Food) +``` + +**What LLM Returns (Simple JSON):** +```json +{ + "entities": [ + { + "entity": "Cornish pasty", + "type": "Recipe" + }, + { + "entity": "beef", + "type": "Food" + }, + { + "entity": "potatoes", + "type": "Food" + } + ], + "relationships": [ + { + "subject": "Cornish pasty", + "subject-type": "Recipe", + "relation": "has_ingredient", + "object": "beef", + "object-type": "Food" + }, + { + "subject": "Cornish pasty", + "subject-type": "Recipe", + "relation": "has_ingredient", + "object": "potatoes", + "object-type": "Food" + } + ] +} +``` + +**What Code Produces (RDF Triples):** +```python +# Normalize entity names to URIs +cornish_pasty_uri = "https://trustgraph.ai/food/cornish-pasty" +beef_uri = "https://trustgraph.ai/food/beef" +potatoes_uri = "https://trustgraph.ai/food/potatoes" + +# Look up relation URI from ontology +has_ingredient_uri = "http://purl.org/ontology/fo/ingredients" # from fo/has_ingredient + +triples = [ + # Entity type triples (as before) + Triple(s=cornish_pasty_uri, p=rdf_type, o="http://purl.org/ontology/fo/Recipe"), + Triple(s=cornish_pasty_uri, p=rdfs_label, o="Cornish pasty"), + + Triple(s=beef_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"), + Triple(s=beef_uri, p=rdfs_label, o="beef"), + + Triple(s=potatoes_uri, p=rdf_type, o="http://purl.org/ontology/fo/Food"), + Triple(s=potatoes_uri, p=rdfs_label, o="potatoes"), + + # Relationship triples + Triple( + s=Value(value=cornish_pasty_uri, is_uri=True), + p=Value(value=has_ingredient_uri, is_uri=True), + o=Value(value=beef_uri, is_uri=True) + ), + Triple( + s=Value(value=cornish_pasty_uri, is_uri=True), + p=Value(value=has_ingredient_uri, is_uri=True), + o=Value(value=potatoes_uri, is_uri=True) + ) +] +``` + +**Key Points:** +- LLM returns natural language entity names: `"Cornish pasty"`, `"beef"`, `"potatoes"` +- LLM includes types to disambiguate: `subject-type`, `object-type` +- LLM uses relation name from schema: `"has_ingredient"` +- Code derives consistent IDs using (name, type): `("Cornish pasty", "Recipe")` → `recipe-cornish-pasty` +- Code looks up relation URI from ontology: `fo/has_ingredient` → full URI +- Same (name, type) tuple always gets same URI (deduplication) + +### Example: Entity Name Disambiguation + +**Problem:** Same name can refer to different entity types. + +**Real-world case:** +``` +"Cornish pasty" can be: +- A Recipe (instructions for making it) +- A Food (the dish itself) +``` + +**How It's Handled:** + +LLM returns both as separate entities: +```json +{ + "entities": [ + {"entity": "Cornish pasty", "type": "Recipe"}, + {"entity": "Cornish pasty", "type": "Food"} + ], + "relationships": [ + { + "subject": "Cornish pasty", + "subject-type": "Recipe", + "relation": "produces", + "object": "Cornish pasty", + "object-type": "Food" + } + ] +} +``` + +**Code Resolution:** +```python +# Different types → different URIs +recipe_uri = normalize("Cornish pasty", "Recipe") +# → "https://trustgraph.ai/food/recipe-cornish-pasty" + +food_uri = normalize("Cornish pasty", "Food") +# → "https://trustgraph.ai/food/food-cornish-pasty" + +# Relationship connects them correctly +triple = Triple( + s=recipe_uri, # The Recipe + p="http://purl.org/ontology/fo/produces", + o=food_uri # The Food +) +``` + +**Why This Works:** +- Type is included in ALL references (entities, relationships, attributes) +- Code uses `(name, type)` tuple as lookup key +- No ambiguity, no collisions + +### Example: Entity Attributes + +**Input Text:** +``` +This Cornish pasty recipe serves 4-6 people and takes 45 minutes to prepare. +``` + +**Ontology Schema (shown to LLM):** +```markdown +## Entity Types: +- Recipe: A recipe is a combination of ingredients and a method + +## Attributes: +- serves: Indicates what the recipe is intended to serve (Recipe → text) +- preparation_time: Time needed to prepare the recipe (Recipe → text) +``` + +**What LLM Returns (Simple JSON):** +```json +{ + "entities": [ + { + "entity": "Cornish pasty recipe", + "type": "Recipe" + } + ], + "attributes": [ + { + "entity": "Cornish pasty recipe", + "entity-type": "Recipe", + "attribute": "serves", + "value": "4-6 people" + }, + { + "entity": "Cornish pasty recipe", + "entity-type": "Recipe", + "attribute": "preparation_time", + "value": "45 minutes" + } + ] +} +``` + +**What Code Produces (RDF Triples):** +```python +# Normalize entity name to URI +recipe_uri = "https://trustgraph.ai/food/cornish-pasty-recipe" + +# Look up attribute URIs from ontology +serves_uri = "http://purl.org/ontology/fo/serves" # from fo/serves +prep_time_uri = "http://purl.org/ontology/fo/preparation_time" # from fo/preparation_time + +triples = [ + # Entity type triple + Triple( + s=Value(value=recipe_uri, is_uri=True), + p=Value(value=rdf_type, is_uri=True), + o=Value(value="http://purl.org/ontology/fo/Recipe", is_uri=True) + ), + + # Label triple (automatic) + Triple( + s=Value(value=recipe_uri, is_uri=True), + p=Value(value=rdfs_label, is_uri=True), + o=Value(value="Cornish pasty recipe", is_uri=False) + ), + + # Attribute triples (objects are literals, not URIs) + Triple( + s=Value(value=recipe_uri, is_uri=True), + p=Value(value=serves_uri, is_uri=True), + o=Value(value="4-6 people", is_uri=False) # Literal value! + ), + Triple( + s=Value(value=recipe_uri, is_uri=True), + p=Value(value=prep_time_uri, is_uri=True), + o=Value(value="45 minutes", is_uri=False) # Literal value! + ) +] +``` + +**Key Points:** +- LLM extracts literal values: `"4-6 people"`, `"45 minutes"` +- LLM includes entity type for disambiguation: `entity-type` +- LLM uses attribute name from schema: `"serves"`, `"preparation_time"` +- Code looks up attribute URI from ontology datatype properties +- **Object is literal** (`is_uri=False`), not a URI reference +- Values stay as natural text, no normalization needed + +**Difference from Relationships:** +- Relationships: both subject and object are entities (URIs) +- Attributes: subject is entity (URI), object is literal value (string/number) + +### Complete Example: Entities + Relationships + Attributes + +**Input Text:** +``` +Cornish pasty is a savory pastry filled with beef and potatoes. +This recipe serves 4 people. +``` + +**What LLM Returns:** +```json +{ + "entities": [ + { + "entity": "Cornish pasty", + "type": "Recipe" + }, + { + "entity": "beef", + "type": "Food" + }, + { + "entity": "potatoes", + "type": "Food" + } + ], + "relationships": [ + { + "subject": "Cornish pasty", + "subject-type": "Recipe", + "relation": "has_ingredient", + "object": "beef", + "object-type": "Food" + }, + { + "subject": "Cornish pasty", + "subject-type": "Recipe", + "relation": "has_ingredient", + "object": "potatoes", + "object-type": "Food" + } + ], + "attributes": [ + { + "entity": "Cornish pasty", + "entity-type": "Recipe", + "attribute": "serves", + "value": "4 people" + } + ] +} +``` + +**Result:** 11 RDF triples generated: +- 3 entity type triples (rdf:type) +- 3 entity label triples (rdfs:label) - automatic +- 2 relationship triples (has_ingredient) +- 1 attribute triple (serves) + +All from simple, natural language extractions by the LLM! + +## References + +- Current implementation: `trustgraph-flow/trustgraph/extract/kg/ontology/extract.py` +- Prompt template: `ontology-prompt.md` +- Test cases: `tests/unit/test_extract/test_ontology/` +- Example ontology: `e2e/test-data/food.ontology` diff --git a/ontology-prompt.md b/ontology-prompt.md new file mode 100644 index 00000000..6be255b7 --- /dev/null +++ b/ontology-prompt.md @@ -0,0 +1,54 @@ +You are a knowledge extraction expert. Extract structured triples from text using ONLY the provided ontology elements. + +## Ontology Classes: + +{% for class_id, class_def in classes.items() %} +- **{{class_id}}**{% if class_def.subclass_of %} (subclass of {{class_def.subclass_of}}){% endif %}{% if class_def.comment %}: {{class_def.comment}}{% endif %} +{% endfor %} + +## Object Properties (connect entities): + +{% for prop_id, prop_def in object_properties.items() %} +- **{{prop_id}}**{% if prop_def.domain and prop_def.range %} ({{prop_def.domain}} → {{prop_def.range}}){% endif %}{% if prop_def.comment %}: {{prop_def.comment}}{% endif %} +{% endfor %} + +## Datatype Properties (entity attributes): + +{% for prop_id, prop_def in datatype_properties.items() %} +- **{{prop_id}}**{% if prop_def.domain and prop_def.range %} ({{prop_def.domain}} → {{prop_def.range}}){% endif %}{% if prop_def.comment %}: {{prop_def.comment}}{% endif %} +{% endfor %} + +## Text to Analyze: + +{{text}} + +## Extraction Rules: + +1. Only use classes defined above for entity types +2. Only use properties defined above for relationships and attributes +3. Respect domain and range constraints where specified +4. For class instances, use `rdf:type` as the predicate +5. Include `rdfs:label` for new entities to provide human-readable names +6. Extract all relevant triples that can be inferred from the text +7. Use entity URIs or meaningful identifiers as subjects/objects + +## Output Format: + +Return ONLY a valid JSON array (no markdown, no code blocks) containing objects with these fields: +- "subject": the subject entity (URI or identifier) +- "predicate": the property (from ontology or rdf:type/rdfs:label) +- "object": the object entity or literal value + +Important: Return raw JSON only, with no markdown formatting, no code blocks, and no backticks. + +## Example Output: + +[ + {"subject": "recipe:cornish-pasty", "predicate": "rdf:type", "object": "Recipe"}, + {"subject": "recipe:cornish-pasty", "predicate": "rdfs:label", "object": "Cornish Pasty"}, + {"subject": "recipe:cornish-pasty", "predicate": "has_ingredient", "object": "ingredient:flour"}, + {"subject": "ingredient:flour", "predicate": "rdf:type", "object": "Ingredient"}, + {"subject": "ingredient:flour", "predicate": "rdfs:label", "object": "plain flour"} +] + +Now extract triples from the text above. diff --git a/trustgraph-flow/trustgraph/extract/kg/ontology/entity_normalizer.py b/trustgraph-flow/trustgraph/extract/kg/ontology/entity_normalizer.py new file mode 100644 index 00000000..712aadbe --- /dev/null +++ b/trustgraph-flow/trustgraph/extract/kg/ontology/entity_normalizer.py @@ -0,0 +1,164 @@ +""" +Entity URI normalization for ontology-based knowledge extraction. + +Converts entity names and types into consistent, collision-free URIs. +""" + +import re +from typing import Tuple + + +def normalize_entity_name(entity_name: str) -> str: + """Normalize entity name to URI-safe identifier. + + Args: + entity_name: Natural language entity name (e.g., "Cornish pasty") + + Returns: + Normalized identifier (e.g., "cornish-pasty") + """ + # Convert to lowercase + normalized = entity_name.lower() + + # Replace spaces and underscores with hyphens + normalized = re.sub(r'[\s_]+', '-', normalized) + + # Remove any characters that aren't alphanumeric, hyphens, or periods + normalized = re.sub(r'[^a-z0-9\-.]', '', normalized) + + # Remove leading/trailing hyphens + normalized = normalized.strip('-') + + # Collapse multiple hyphens + normalized = re.sub(r'-+', '-', normalized) + + return normalized + + +def normalize_type_identifier(type_id: str) -> str: + """Normalize ontology type identifier to URI-safe format. + + Handles prefixed types like "fo/Recipe" by converting to "fo-recipe". + + Args: + type_id: Ontology type identifier (e.g., "fo/Recipe", "Food") + + Returns: + Normalized type identifier (e.g., "fo-recipe", "food") + """ + # Convert to lowercase + normalized = type_id.lower() + + # Replace slashes, colons, and spaces with hyphens + normalized = re.sub(r'[/:.\s_]+', '-', normalized) + + # Remove any remaining non-alphanumeric characters except hyphens + normalized = re.sub(r'[^a-z0-9\-]', '', normalized) + + # Remove leading/trailing hyphens + normalized = normalized.strip('-') + + # Collapse multiple hyphens + normalized = re.sub(r'-+', '-', normalized) + + return normalized + + +def build_entity_uri(entity_name: str, entity_type: str, ontology_id: str, + base_uri: str = "https://trustgraph.ai") -> str: + """Build a unique URI for an entity based on its name and type. + + The type is included in the URI to prevent collisions when the same + name refers to different entity types (e.g., "Cornish pasty" as both + Recipe and Food). + + Args: + entity_name: Natural language entity name (e.g., "Cornish pasty") + entity_type: Ontology type (e.g., "fo/Recipe") + ontology_id: Ontology identifier (e.g., "food") + base_uri: Base URI for entity URIs (default: "https://trustgraph.ai") + + Returns: + Full entity URI (e.g., "https://trustgraph.ai/food/fo-recipe-cornish-pasty") + + Examples: + >>> build_entity_uri("Cornish pasty", "fo/Recipe", "food") + 'https://trustgraph.ai/food/fo-recipe-cornish-pasty' + + >>> build_entity_uri("Cornish pasty", "fo/Food", "food") + 'https://trustgraph.ai/food/fo-food-cornish-pasty' + + >>> build_entity_uri("beef", "fo/Food", "food") + 'https://trustgraph.ai/food/fo-food-beef' + """ + type_part = normalize_type_identifier(entity_type) + name_part = normalize_entity_name(entity_name) + + # Combine type and name to ensure uniqueness + entity_id = f"{type_part}-{name_part}" + + # Build full URI + return f"{base_uri}/{ontology_id}/{entity_id}" + + +class EntityRegistry: + """Registry to track entity name/type tuples and their assigned URIs. + + Ensures that the same (entity_name, entity_type) tuple always maps + to the same URI, enabling deduplication across the extraction process. + """ + + def __init__(self, ontology_id: str, base_uri: str = "https://trustgraph.ai"): + """Initialize the entity registry. + + Args: + ontology_id: Ontology identifier (e.g., "food") + base_uri: Base URI for entity URIs + """ + self.ontology_id = ontology_id + self.base_uri = base_uri + self._registry = {} # (entity_name, entity_type) -> uri + + def get_or_create_uri(self, entity_name: str, entity_type: str) -> str: + """Get existing URI or create new one for entity. + + Args: + entity_name: Natural language entity name + entity_type: Ontology type identifier + + Returns: + URI for this entity (same URI for same name/type tuple) + """ + key = (entity_name, entity_type) + + if key not in self._registry: + uri = build_entity_uri( + entity_name, + entity_type, + self.ontology_id, + self.base_uri + ) + self._registry[key] = uri + + return self._registry[key] + + def lookup(self, entity_name: str, entity_type: str) -> str: + """Look up URI for entity (returns None if not registered). + + Args: + entity_name: Natural language entity name + entity_type: Ontology type identifier + + Returns: + URI for this entity, or None if not found + """ + key = (entity_name, entity_type) + return self._registry.get(key) + + def clear(self): + """Clear all registered entities.""" + self._registry.clear() + + def size(self) -> int: + """Get number of registered entities.""" + return len(self._registry) diff --git a/trustgraph-flow/trustgraph/extract/kg/ontology/extract.py b/trustgraph-flow/trustgraph/extract/kg/ontology/extract.py index 12832eaf..335f07d2 100644 --- a/trustgraph-flow/trustgraph/extract/kg/ontology/extract.py +++ b/trustgraph-flow/trustgraph/extract/kg/ontology/extract.py @@ -20,6 +20,8 @@ from .ontology_embedder import OntologyEmbedder from .vector_store import InMemoryVectorStore from .text_processor import TextProcessor from .ontology_selector import OntologySelector, OntologySubset +from .simplified_parser import parse_extraction_response +from .triple_converter import TripleConverter logger = logging.getLogger(__name__) @@ -298,25 +300,10 @@ class Processor(FlowProcessor): # Build extraction prompt variables prompt_variables = self.build_extraction_variables(chunk, ontology_subset) - # Call prompt service for extraction - try: - # Use prompt() method with extract-with-ontologies prompt ID - triples_response = await flow("prompt-request").prompt( - id="extract-with-ontologies", - variables=prompt_variables - ) - logger.debug(f"Extraction response: {triples_response}") - - if not isinstance(triples_response, list): - logger.error("Expected list of triples from prompt service") - triples_response = [] - - except Exception as e: - logger.error(f"Prompt service error: {e}", exc_info=True) - triples_response = [] - - # Parse and validate triples - triples = self.parse_and_validate_triples(triples_response, ontology_subset) + # Extract using simplified entity-relationship-attribute format + triples = await self.extract_with_simplified_format( + flow, chunk, ontology_subset, prompt_variables + ) # Add metadata triples for t in v.metadata.metadata: @@ -362,6 +349,55 @@ class Processor(FlowProcessor): [] ) + async def extract_with_simplified_format( + self, + flow, + chunk: str, + ontology_subset: OntologySubset, + prompt_variables: Dict[str, Any] + ) -> List[Triple]: + """Extract triples using simplified entity-relationship-attribute format. + + Args: + flow: Flow object for accessing services + chunk: Text chunk to extract from + ontology_subset: Selected ontology subset + prompt_variables: Variables for prompt template + + Returns: + List of Triple objects + """ + try: + # Call prompt service with simplified format prompt + extraction_response = await flow("prompt-request").prompt( + id="extract-with-ontologies", + variables=prompt_variables + ) + logger.debug(f"Simplified extraction response: {extraction_response}") + + # Parse response into structured format + extraction_result = parse_extraction_response(extraction_response) + + if not extraction_result: + logger.warning("Failed to parse extraction response") + return [] + + logger.info(f"Parsed {len(extraction_result.entities)} entities, " + f"{len(extraction_result.relationships)} relationships, " + f"{len(extraction_result.attributes)} attributes") + + # Convert to RDF triples + converter = TripleConverter(ontology_subset, ontology_subset.ontology_id) + triples = converter.convert_all(extraction_result) + + logger.info(f"Generated {len(triples)} RDF triples from simplified extraction") + + return triples + + except Exception as e: + logger.error(f"Simplified extraction error: {e}", exc_info=True) + return [] + def build_extraction_variables(self, chunk: str, ontology_subset: OntologySubset) -> Dict[str, Any]: """Build variables for ontology-based extraction prompt template. diff --git a/trustgraph-flow/trustgraph/extract/kg/ontology/simplified_parser.py b/trustgraph-flow/trustgraph/extract/kg/ontology/simplified_parser.py new file mode 100644 index 00000000..3131d977 --- /dev/null +++ b/trustgraph-flow/trustgraph/extract/kg/ontology/simplified_parser.py @@ -0,0 +1,234 @@ +""" +Parser for simplified ontology extraction JSON format. + +Parses the new entity-relationship-attribute format from LLM responses. +""" + +import json +import logging +from typing import List, Dict, Any, Optional +from dataclasses import dataclass + +logger = logging.getLogger(__name__) + + +@dataclass +class Entity: + """Represents an extracted entity.""" + entity: str + type: str + + +@dataclass +class Relationship: + """Represents an extracted relationship.""" + subject: str + subject_type: str + relation: str + object: str + object_type: str + + +@dataclass +class Attribute: + """Represents an extracted attribute.""" + entity: str + entity_type: str + attribute: str + value: str + + +@dataclass +class ExtractionResult: + """Complete extraction result.""" + entities: List[Entity] + relationships: List[Relationship] + attributes: List[Attribute] + + +def parse_extraction_response(response: Any) -> Optional[ExtractionResult]: + """Parse LLM extraction response into structured format. + + Args: + response: LLM response (string JSON or already parsed dict) + + Returns: + ExtractionResult with parsed entities/relationships/attributes, + or None if parsing fails + """ + # Handle string response (parse JSON) + if isinstance(response, str): + try: + data = json.loads(response) + except json.JSONDecodeError as e: + logger.error(f"Failed to parse JSON response: {e}") + logger.debug(f"Response was: {response[:500]}") + return None + elif isinstance(response, dict): + data = response + else: + logger.error(f"Unexpected response type: {type(response)}") + return None + + # Validate structure + if not isinstance(data, dict): + logger.error(f"Expected dict, got {type(data)}") + return None + + # Parse entities + entities = [] + entities_data = data.get('entities', []) + if not isinstance(entities_data, list): + logger.warning(f"'entities' is not a list: {type(entities_data)}") + entities_data = [] + + for entity_data in entities_data: + try: + entity = parse_entity(entity_data) + if entity: + entities.append(entity) + except Exception as e: + logger.warning(f"Failed to parse entity {entity_data}: {e}") + + # Parse relationships + relationships = [] + relationships_data = data.get('relationships', []) + if not isinstance(relationships_data, list): + logger.warning(f"'relationships' is not a list: {type(relationships_data)}") + relationships_data = [] + + for rel_data in relationships_data: + try: + relationship = parse_relationship(rel_data) + if relationship: + relationships.append(relationship) + except Exception as e: + logger.warning(f"Failed to parse relationship {rel_data}: {e}") + + # Parse attributes + attributes = [] + attributes_data = data.get('attributes', []) + if not isinstance(attributes_data, list): + logger.warning(f"'attributes' is not a list: {type(attributes_data)}") + attributes_data = [] + + for attr_data in attributes_data: + try: + attribute = parse_attribute(attr_data) + if attribute: + attributes.append(attribute) + except Exception as e: + logger.warning(f"Failed to parse attribute {attr_data}: {e}") + + return ExtractionResult( + entities=entities, + relationships=relationships, + attributes=attributes + ) + + +def parse_entity(data: Dict[str, Any]) -> Optional[Entity]: + """Parse entity from dict. + + Supports both kebab-case and snake_case field names for compatibility. + + Args: + data: Entity dict with 'entity' and 'type' fields + + Returns: + Entity object or None if invalid + """ + if not isinstance(data, dict): + logger.warning(f"Entity data is not a dict: {type(data)}") + return None + + entity = data.get('entity') + entity_type = data.get('type') + + if not entity or not entity_type: + logger.warning(f"Missing required fields in entity: {data}") + return None + + if not isinstance(entity, str) or not isinstance(entity_type, str): + logger.warning(f"Entity fields must be strings: {data}") + return None + + return Entity(entity=entity, type=entity_type) + + +def parse_relationship(data: Dict[str, Any]) -> Optional[Relationship]: + """Parse relationship from dict. + + Supports both kebab-case and snake_case field names for compatibility. + + Args: + data: Relationship dict with subject, subject-type, relation, object, object-type + + Returns: + Relationship object or None if invalid + """ + if not isinstance(data, dict): + logger.warning(f"Relationship data is not a dict: {type(data)}") + return None + + subject = data.get('subject') + subject_type = data.get('subject-type') or data.get('subject_type') + relation = data.get('relation') + obj = data.get('object') + object_type = data.get('object-type') or data.get('object_type') + + if not all([subject, subject_type, relation, obj, object_type]): + logger.warning(f"Missing required fields in relationship: {data}") + return None + + if not all(isinstance(v, str) for v in [subject, subject_type, relation, obj, object_type]): + logger.warning(f"Relationship fields must be strings: {data}") + return None + + return Relationship( + subject=subject, + subject_type=subject_type, + relation=relation, + object=obj, + object_type=object_type + ) + + +def parse_attribute(data: Dict[str, Any]) -> Optional[Attribute]: + """Parse attribute from dict. + + Supports both kebab-case and snake_case field names for compatibility. + + Args: + data: Attribute dict with entity, entity-type, attribute, value + + Returns: + Attribute object or None if invalid + """ + if not isinstance(data, dict): + logger.warning(f"Attribute data is not a dict: {type(data)}") + return None + + entity = data.get('entity') + entity_type = data.get('entity-type') or data.get('entity_type') + attribute = data.get('attribute') + value = data.get('value') + + if not all([entity, entity_type, attribute, value is not None]): + logger.warning(f"Missing required fields in attribute: {data}") + return None + + if not all(isinstance(v, str) for v in [entity, entity_type, attribute]): + logger.warning(f"Attribute fields must be strings: {data}") + return None + + # Value can be string, number, bool - convert to string + if not isinstance(value, str): + value = str(value) + + return Attribute( + entity=entity, + entity_type=entity_type, + attribute=attribute, + value=value + ) diff --git a/trustgraph-flow/trustgraph/extract/kg/ontology/triple_converter.py b/trustgraph-flow/trustgraph/extract/kg/ontology/triple_converter.py new file mode 100644 index 00000000..2eb43b19 --- /dev/null +++ b/trustgraph-flow/trustgraph/extract/kg/ontology/triple_converter.py @@ -0,0 +1,228 @@ +""" +Converts simplified extraction format to RDF triples. + +Transforms entities, relationships, and attributes into proper RDF triples +with full URIs and correct is_uri flags. +""" + +import logging +from typing import List, Optional + +from .... schema import Triple, Value +from .... rdf import RDF_TYPE, RDF_LABEL + +from .simplified_parser import Entity, Relationship, Attribute, ExtractionResult +from .entity_normalizer import EntityRegistry +from .ontology_selector import OntologySubset + +logger = logging.getLogger(__name__) + + +class TripleConverter: + """Converts extraction results to RDF triples.""" + + def __init__(self, ontology_subset: OntologySubset, ontology_id: str): + """Initialize converter. + + Args: + ontology_subset: Ontology subset with classes and properties + ontology_id: Ontology identifier for URI generation + """ + self.ontology_subset = ontology_subset + self.ontology_id = ontology_id + self.entity_registry = EntityRegistry(ontology_id) + + def convert_all(self, extraction: ExtractionResult) -> List[Triple]: + """Convert complete extraction result to RDF triples. + + Args: + extraction: Parsed extraction with entities/relationships/attributes + + Returns: + List of RDF Triple objects + """ + triples = [] + + # Convert entities (generates type + label triples) + for entity in extraction.entities: + entity_triples = self.convert_entity(entity) + triples.extend(entity_triples) + + # Convert relationships + for relationship in extraction.relationships: + rel_triple = self.convert_relationship(relationship) + if rel_triple: + triples.append(rel_triple) + + # Convert attributes + for attribute in extraction.attributes: + attr_triple = self.convert_attribute(attribute) + if attr_triple: + triples.append(attr_triple) + + return triples + + def convert_entity(self, entity: Entity) -> List[Triple]: + """Convert entity to RDF triples (type + label). + + Args: + entity: Entity object with name and type + + Returns: + List containing type triple and label triple + """ + triples = [] + + # Get or create URI for this entity + entity_uri = self.entity_registry.get_or_create_uri( + entity.entity, + entity.type + ) + + # Look up class URI from ontology + class_uri = self._get_class_uri(entity.type) + if not class_uri: + logger.warning(f"Unknown entity type '{entity.type}', skipping entity '{entity.entity}'") + return triples + + # Generate type triple: entity rdf:type ClassURI + type_triple = Triple( + s=Value(value=entity_uri, is_uri=True), + p=Value(value=RDF_TYPE, is_uri=True), + o=Value(value=class_uri, is_uri=True) + ) + triples.append(type_triple) + + # Generate label triple: entity rdfs:label "entity name" + label_triple = Triple( + s=Value(value=entity_uri, is_uri=True), + p=Value(value=RDF_LABEL, is_uri=True), + o=Value(value=entity.entity, is_uri=False) # Literal! + ) + triples.append(label_triple) + + return triples + + def convert_relationship(self, relationship: Relationship) -> Optional[Triple]: + """Convert relationship to RDF triple. + + Args: + relationship: Relationship with subject/object entities and relation + + Returns: + Triple connecting two entity URIs via property URI, or None if invalid + """ + # Get URIs for subject and object entities + subject_uri = self.entity_registry.get_or_create_uri( + relationship.subject, + relationship.subject_type + ) + + object_uri = self.entity_registry.get_or_create_uri( + relationship.object, + relationship.object_type + ) + + # Look up property URI from ontology + property_uri = self._get_object_property_uri(relationship.relation) + if not property_uri: + logger.warning(f"Unknown relationship '{relationship.relation}', skipping") + return None + + # Generate triple: subject property object + return Triple( + s=Value(value=subject_uri, is_uri=True), + p=Value(value=property_uri, is_uri=True), + o=Value(value=object_uri, is_uri=True) + ) + + def convert_attribute(self, attribute: Attribute) -> Optional[Triple]: + """Convert attribute to RDF triple. + + Args: + attribute: Attribute with entity, attribute name, and literal value + + Returns: + Triple with entity URI, property URI, and literal value, or None if invalid + """ + # Get URI for entity + entity_uri = self.entity_registry.get_or_create_uri( + attribute.entity, + attribute.entity_type + ) + + # Look up property URI from ontology + property_uri = self._get_datatype_property_uri(attribute.attribute) + if not property_uri: + logger.warning(f"Unknown attribute '{attribute.attribute}', skipping") + return None + + # Generate triple: entity property "literal value" + return Triple( + s=Value(value=entity_uri, is_uri=True), + p=Value(value=property_uri, is_uri=True), + o=Value(value=attribute.value, is_uri=False) # Literal! + ) + + def _get_class_uri(self, class_id: str) -> Optional[str]: + """Get full URI for ontology class. + + Args: + class_id: Class identifier (e.g., "fo/Recipe") + + Returns: + Full class URI or None if not found + """ + if class_id not in self.ontology_subset.classes: + return None + + class_def = self.ontology_subset.classes[class_id] + + # Extract URI from class definition + if isinstance(class_def, dict) and 'uri' in class_def: + return class_def['uri'] + + # Fallback: construct URI + return f"https://trustgraph.ai/ontology/{self.ontology_id}#{class_id}" + + def _get_object_property_uri(self, property_id: str) -> Optional[str]: + """Get full URI for object property. + + Args: + property_id: Property identifier (e.g., "fo/has_ingredient") + + Returns: + Full property URI or None if not found + """ + if property_id not in self.ontology_subset.object_properties: + return None + + prop_def = self.ontology_subset.object_properties[property_id] + + # Extract URI from property definition + if isinstance(prop_def, dict) and 'uri' in prop_def: + return prop_def['uri'] + + # Fallback: construct URI + return f"https://trustgraph.ai/ontology/{self.ontology_id}#{property_id}" + + def _get_datatype_property_uri(self, property_id: str) -> Optional[str]: + """Get full URI for datatype property. + + Args: + property_id: Property identifier (e.g., "fo/serves") + + Returns: + Full property URI or None if not found + """ + if property_id not in self.ontology_subset.datatype_properties: + return None + + prop_def = self.ontology_subset.datatype_properties[property_id] + + # Extract URI from property definition + if isinstance(prop_def, dict) and 'uri' in prop_def: + return prop_def['uri'] + + # Fallback: construct URI + return f"https://trustgraph.ai/ontology/{self.ontology_id}#{property_id}"