Subgraph provenance (#694)

Replace per-triple provenance reification with subgraph model

Extraction provenance previously created a full reification (statement
URI, activity, agent) for every single extracted triple, producing ~13
provenance triples per knowledge triple.  Since each chunk is processed
by a single LLM call, this was both redundant and semantically
inaccurate.

Now one subgraph object is created per chunk extraction, with
tg:contains linking to each extracted triple.  For 20 extractions from
a chunk this reduces provenance from ~260 triples to ~33.

- Rename tg:reifies -> tg:contains, stmt_uri -> subgraph_uri
- Replace triple_provenance_triples() with subgraph_provenance_triples()
- Refactor kg-extract-definitions and kg-extract-relationships to
  generate provenance once per chunk instead of per triple
- Add subgraph provenance to kg-extract-ontology and kg-extract-agent
  (previously had none)
- Update CLI tools and tech specs to match

Also rename tg-show-document-hierarchy to tg-show-extraction-provenance.

Added extra typing for extraction provenance, fixed extraction prov CLI
This commit is contained in:
cybermaggedon 2026-03-13 11:37:59 +00:00 committed by GitHub
parent 35128ff019
commit 64e3f6bd0d
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
20 changed files with 463 additions and 193 deletions

View file

@ -0,0 +1,205 @@
# Extraction Provenance: Subgraph Model
## Problem
Extraction-time provenance currently generates a full reification per
extracted triple: a unique `stmt_uri`, `activity_uri`, and associated
PROV-O metadata for every single knowledge fact. Processing one chunk
that yields 20 relationships produces ~220 provenance triples on top of
the ~20 knowledge triples — a roughly 10:1 overhead.
This is both expensive (storage, indexing, transmission) and semantically
inaccurate. Each chunk is processed by a single LLM call that produces
all its triples in one transaction. The current per-triple model
obscures that by creating the illusion of 20 independent extraction
events.
Additionally, two of the four extraction processors (kg-extract-ontology,
kg-extract-agent) have no provenance at all, leaving gaps in the audit
trail.
## Solution
Replace per-triple reification with a **subgraph model**: one provenance
record per chunk extraction, shared across all triples produced from that
chunk.
### Terminology Change
| Old | New |
|-----|-----|
| `stmt_uri` (`https://trustgraph.ai/stmt/{uuid}`) | `subgraph_uri` (`https://trustgraph.ai/subgraph/{uuid}`) |
| `statement_uri()` | `subgraph_uri()` |
| `tg:reifies` (1:1, identity) | `tg:contains` (1:many, containment) |
### Target Structure
All provenance triples go in the `urn:graph:source` named graph.
```
# Subgraph contains each extracted triple (RDF-star quoted triples)
<subgraph> tg:contains <<s1 p1 o1>> .
<subgraph> tg:contains <<s2 p2 o2>> .
<subgraph> tg:contains <<s3 p3 o3>> .
# Derivation from source chunk
<subgraph> prov:wasDerivedFrom <chunk_uri> .
<subgraph> prov:wasGeneratedBy <activity> .
# Activity: one per chunk extraction
<activity> rdf:type prov:Activity .
<activity> rdfs:label "{component_name} extraction" .
<activity> prov:used <chunk_uri> .
<activity> prov:wasAssociatedWith <agent> .
<activity> prov:startedAtTime "2026-03-13T10:00:00Z" .
<activity> tg:componentVersion "0.25.0" .
<activity> tg:llmModel "gpt-4" . # if available
<activity> tg:ontology <ontology_uri> . # if available
# Agent: stable per component
<agent> rdf:type prov:Agent .
<agent> rdfs:label "{component_name}" .
```
### Volume Comparison
For a chunk producing N extracted triples:
| | Old (per-triple) | New (subgraph) |
|---|---|---|
| `tg:contains` / `tg:reifies` | N | N |
| Activity triples | ~9 x N | ~9 |
| Agent triples | 2 x N | 2 |
| Statement/subgraph metadata | 2 x N | 2 |
| **Total provenance triples** | **~13N** | **N + 13** |
| **Example (N=20)** | **~260** | **33** |
## Scope
### Processors to Update (existing provenance, per-triple)
**kg-extract-definitions**
(`trustgraph-flow/trustgraph/extract/kg/definitions/extract.py`)
Currently calls `statement_uri()` + `triple_provenance_triples()` inside
the per-definition loop.
Changes:
- Move `subgraph_uri()` and `activity_uri()` creation before the loop
- Collect `tg:contains` triples inside the loop
- Emit shared activity/agent/derivation block once after the loop
**kg-extract-relationships**
(`trustgraph-flow/trustgraph/extract/kg/relationships/extract.py`)
Same pattern as definitions. Same changes.
### Processors to Add Provenance (currently missing)
**kg-extract-ontology**
(`trustgraph-flow/trustgraph/extract/kg/ontology/extract.py`)
Currently emits triples with no provenance. Add subgraph provenance
using the same pattern: one subgraph per chunk, `tg:contains` for each
extracted triple.
**kg-extract-agent**
(`trustgraph-flow/trustgraph/extract/kg/agent/extract.py`)
Currently emits triples with no provenance. Add subgraph provenance
using the same pattern.
### Shared Provenance Library Changes
**`trustgraph-base/trustgraph/provenance/triples.py`**
- Replace `triple_provenance_triples()` with `subgraph_provenance_triples()`
- New function accepts a list of extracted triples instead of a single one
- Generates one `tg:contains` per triple, shared activity/agent block
- Remove old `triple_provenance_triples()`
**`trustgraph-base/trustgraph/provenance/uris.py`**
- Replace `statement_uri()` with `subgraph_uri()`
**`trustgraph-base/trustgraph/provenance/namespaces.py`**
- Replace `TG_REIFIES` with `TG_CONTAINS`
### Not in Scope
- **kg-extract-topics**: older-style processor, not currently used in
standard flows
- **kg-extract-rows**: produces rows not triples, different provenance
model
- **Query-time provenance** (`urn:graph:retrieval`): separate concern,
already uses a different pattern (question/exploration/focus/synthesis)
- **Document/page/chunk provenance** (PDF decoder, chunker): already uses
`derived_entity_triples()` which is per-entity, not per-triple — no
redundancy issue
## Implementation Notes
### Processor Loop Restructure
Before (per-triple, in relationships):
```python
for rel in rels:
# ... build relationship_triple ...
stmt_uri = statement_uri()
prov_triples = triple_provenance_triples(
stmt_uri=stmt_uri,
extracted_triple=relationship_triple,
...
)
triples.extend(set_graph(prov_triples, GRAPH_SOURCE))
```
After (subgraph):
```python
sg_uri = subgraph_uri()
for rel in rels:
# ... build relationship_triple ...
extracted_triples.append(relationship_triple)
prov_triples = subgraph_provenance_triples(
subgraph_uri=sg_uri,
extracted_triples=extracted_triples,
chunk_uri=chunk_uri,
component_name=default_ident,
component_version=COMPONENT_VERSION,
llm_model=llm_model,
ontology_uri=ontology_uri,
)
triples.extend(set_graph(prov_triples, GRAPH_SOURCE))
```
### New Helper Signature
```python
def subgraph_provenance_triples(
subgraph_uri: str,
extracted_triples: List[Triple],
chunk_uri: str,
component_name: str,
component_version: str,
llm_model: Optional[str] = None,
ontology_uri: Optional[str] = None,
timestamp: Optional[str] = None,
) -> List[Triple]:
"""
Build provenance triples for a subgraph of extracted knowledge.
Creates:
- tg:contains link for each extracted triple (RDF-star quoted)
- One prov:wasDerivedFrom link to source chunk
- One activity with agent metadata
"""
```
### Breaking Change
This is a breaking change to the provenance model. Provenance has not
been released, so no migration is needed. The old `tg:reifies` /
`statement_uri` code can be removed outright.

View file

@ -311,10 +311,10 @@ activity:chunk-789 tg:chunkOverlap 200 .
# The extracted triple (edge) # The extracted triple (edge)
entity:JohnSmith rel:worksAt entity:AcmeCorp . entity:JohnSmith rel:worksAt entity:AcmeCorp .
# Statement object pointing at the edge (RDF 1.2 reification) # Subgraph containing the extracted triples
stmt:001 tg:reifies <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> . subgraph:001 tg:contains <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
stmt:001 prov:wasDerivedFrom chunk:123-1-1 . subgraph:001 prov:wasDerivedFrom chunk:123-1-1 .
stmt:001 prov:wasGeneratedBy activity:extract-999 . subgraph:001 prov:wasGeneratedBy activity:extract-999 .
activity:extract-999 a prov:Activity . activity:extract-999 a prov:Activity .
activity:extract-999 prov:used chunk:123-1-1 . activity:extract-999 prov:used chunk:123-1-1 .
@ -344,7 +344,7 @@ Custom predicates under the `tg:` namespace for extraction-specific metadata:
| Predicate | Domain | Description | | Predicate | Domain | Description |
|-----------|--------|-------------| |-----------|--------|-------------|
| `tg:reifies` | Statement | Points at the triple this statement object represents | | `tg:contains` | Subgraph | Points at a triple contained in this extraction subgraph |
| `tg:pageCount` | Document | Total number of pages in source document | | `tg:pageCount` | Document | Total number of pages in source document |
| `tg:mimeType` | Document | MIME type of source document | | `tg:mimeType` | Document | MIME type of source document |
| `tg:pageNumber` | Page | Page number in source document | | `tg:pageNumber` | Page | Page number in source document |
@ -383,7 +383,7 @@ prov:startedAtTime rdfs:label "started at" .
**TrustGraph Predicates:** **TrustGraph Predicates:**
``` ```
tg:reifies rdfs:label "reifies" . tg:contains rdfs:label "contains" .
tg:pageCount rdfs:label "page count" . tg:pageCount rdfs:label "page count" .
tg:mimeType rdfs:label "MIME type" . tg:mimeType rdfs:label "MIME type" .
tg:pageNumber rdfs:label "page number" . tg:pageNumber rdfs:label "page number" .
@ -416,20 +416,20 @@ For finer-grained provenance, it would be valuable to record exactly where withi
# The extracted triple # The extracted triple
entity:JohnSmith rel:worksAt entity:AcmeCorp . entity:JohnSmith rel:worksAt entity:AcmeCorp .
# Statement with sub-chunk provenance # Subgraph with sub-chunk provenance
stmt:001 tg:reifies <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> . subgraph:001 tg:contains <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
stmt:001 prov:wasDerivedFrom chunk:123-1-1 . subgraph:001 prov:wasDerivedFrom chunk:123-1-1 .
stmt:001 tg:sourceText "John Smith has worked at Acme Corp since 2019" . subgraph:001 tg:sourceText "John Smith has worked at Acme Corp since 2019" .
stmt:001 tg:sourceCharOffset 1547 . subgraph:001 tg:sourceCharOffset 1547 .
stmt:001 tg:sourceCharLength 46 . subgraph:001 tg:sourceCharLength 46 .
``` ```
**Example with text range (alternative):** **Example with text range (alternative):**
``` ```
stmt:001 tg:reifies <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> . subgraph:001 tg:contains <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
stmt:001 prov:wasDerivedFrom chunk:123-1-1 . subgraph:001 prov:wasDerivedFrom chunk:123-1-1 .
stmt:001 tg:sourceRange "1547-1593" . subgraph:001 tg:sourceRange "1547-1593" .
stmt:001 tg:sourceText "John Smith has worked at Acme Corp since 2019" . subgraph:001 tg:sourceText "John Smith has worked at Acme Corp since 2019" .
``` ```
**Implementation considerations:** **Implementation considerations:**

View file

@ -193,7 +193,7 @@ When storing explainability data, URIs from `uri_map` are used.
Selected edges can be traced back to source documents: Selected edges can be traced back to source documents:
1. Query for reifying statement: `?stmt tg:reifies <<s p o>>` 1. Query for containing subgraph: `?subgraph tg:contains <<s p o>>`
2. Follow `prov:wasDerivedFrom` chain to root document 2. Follow `prov:wasDerivedFrom` chain to root document
3. Each step in chain: chunk → page → document 3. Each step in chain: chunk → page → document
@ -209,7 +209,7 @@ elif term.type == TRIPLE:
This enables queries like: This enables queries like:
``` ```
?stmt tg:reifies <<http://example.org/s http://example.org/p "value">> ?subgraph tg:contains <<http://example.org/s http://example.org/p "value">>
``` ```
## CLI Usage ## CLI Usage

View file

@ -128,7 +128,7 @@ class TestAgentKgExtractionIntegration:
# Parse and process # Parse and process
extraction_data = extractor.parse_jsonl(agent_response) extraction_data = extractor.parse_jsonl(agent_response)
triples, entity_contexts = extractor.process_extraction_data(extraction_data, v.metadata) triples, entity_contexts, extracted_triples = extractor.process_extraction_data(extraction_data, v.metadata)
# Emit outputs # Emit outputs
if triples: if triples:

View file

@ -168,7 +168,7 @@ This is not JSON at all
} }
] ]
triples, entity_contexts = agent_extractor.process_extraction_data(data, sample_metadata) triples, entity_contexts, _ = agent_extractor.process_extraction_data(data, sample_metadata)
# Check entity label triple # Check entity label triple
label_triple = next((t for t in triples if t.p.iri == RDF_LABEL and t.o.value == "Machine Learning"), None) label_triple = next((t for t in triples if t.p.iri == RDF_LABEL and t.o.value == "Machine Learning"), None)
@ -206,7 +206,7 @@ This is not JSON at all
} }
] ]
triples, entity_contexts = agent_extractor.process_extraction_data(data, sample_metadata) triples, entity_contexts, _ = agent_extractor.process_extraction_data(data, sample_metadata)
# Check that subject, predicate, and object labels are created # Check that subject, predicate, and object labels are created
subject_uri = f"{TRUSTGRAPH_ENTITIES}Machine%20Learning" subject_uri = f"{TRUSTGRAPH_ENTITIES}Machine%20Learning"
@ -244,7 +244,7 @@ This is not JSON at all
} }
] ]
triples, entity_contexts = agent_extractor.process_extraction_data(data, sample_metadata) triples, entity_contexts, _ = agent_extractor.process_extraction_data(data, sample_metadata)
# Check that object labels are not created for literal objects # Check that object labels are not created for literal objects
object_labels = [t for t in triples if t.p.iri == RDF_LABEL and t.o.value == "95%"] object_labels = [t for t in triples if t.p.iri == RDF_LABEL and t.o.value == "95%"]
@ -253,7 +253,7 @@ This is not JSON at all
def test_process_extraction_data_combined(self, agent_extractor, sample_metadata, sample_extraction_data): def test_process_extraction_data_combined(self, agent_extractor, sample_metadata, sample_extraction_data):
"""Test processing of combined definitions and relationships""" """Test processing of combined definitions and relationships"""
triples, entity_contexts = agent_extractor.process_extraction_data(sample_extraction_data, sample_metadata) triples, entity_contexts, _ = agent_extractor.process_extraction_data(sample_extraction_data, sample_metadata)
# Check that we have both definition and relationship triples # Check that we have both definition and relationship triples
definition_triples = [t for t in triples if t.p.iri == DEFINITION] definition_triples = [t for t in triples if t.p.iri == DEFINITION]
@ -272,7 +272,7 @@ This is not JSON at all
{"type": "definition", "entity": "Test Entity", "definition": "Test definition"} {"type": "definition", "entity": "Test Entity", "definition": "Test definition"}
] ]
triples, entity_contexts = agent_extractor.process_extraction_data(data, metadata) triples, entity_contexts, _ = agent_extractor.process_extraction_data(data, metadata)
# Should not create subject-of relationships when no metadata ID # Should not create subject-of relationships when no metadata ID
subject_of_triples = [t for t in triples if t.p.iri == SUBJECT_OF] subject_of_triples = [t for t in triples if t.p.iri == SUBJECT_OF]
@ -285,7 +285,7 @@ This is not JSON at all
"""Test processing of empty extraction data""" """Test processing of empty extraction data"""
data = [] data = []
triples, entity_contexts = agent_extractor.process_extraction_data(data, sample_metadata) triples, entity_contexts, _ = agent_extractor.process_extraction_data(data, sample_metadata)
# Should have no entity contexts # Should have no entity contexts
assert len(entity_contexts) == 0 assert len(entity_contexts) == 0
@ -300,7 +300,7 @@ This is not JSON at all
{"type": "relationship", "subject": "A", "predicate": "rel", "object": "B", "object-entity": True} {"type": "relationship", "subject": "A", "predicate": "rel", "object": "B", "object-entity": True}
] ]
triples, entity_contexts = agent_extractor.process_extraction_data(data, sample_metadata) triples, entity_contexts, _ = agent_extractor.process_extraction_data(data, sample_metadata)
# Should process valid items and ignore unknown types # Should process valid items and ignore unknown types
assert len(entity_contexts) == 1 # Only the definition creates entity context assert len(entity_contexts) == 1 # Only the definition creates entity context

View file

@ -168,7 +168,7 @@ class TestAgentKgExtractionEdgeCases:
"""Test processing with empty or minimal metadata""" """Test processing with empty or minimal metadata"""
# Test with None metadata - may not raise AttributeError depending on implementation # Test with None metadata - may not raise AttributeError depending on implementation
try: try:
triples, contexts = agent_extractor.process_extraction_data([], None) triples, contexts, _ = agent_extractor.process_extraction_data([], None)
# If it doesn't raise, check the results # If it doesn't raise, check the results
assert len(triples) == 0 assert len(triples) == 0
assert len(contexts) == 0 assert len(contexts) == 0
@ -178,14 +178,14 @@ class TestAgentKgExtractionEdgeCases:
# Test with metadata without ID # Test with metadata without ID
metadata = Metadata(id=None) metadata = Metadata(id=None)
triples, contexts = agent_extractor.process_extraction_data([], metadata) triples, contexts, _ = agent_extractor.process_extraction_data([], metadata)
assert len(triples) == 0 assert len(triples) == 0
assert len(contexts) == 0 assert len(contexts) == 0
# Test with metadata with empty string ID # Test with metadata with empty string ID
metadata = Metadata(id="") metadata = Metadata(id="")
data = [{"type": "definition", "entity": "Test", "definition": "Test def"}] data = [{"type": "definition", "entity": "Test", "definition": "Test def"}]
triples, contexts = agent_extractor.process_extraction_data(data, metadata) triples, contexts, _ = agent_extractor.process_extraction_data(data, metadata)
# Should not create subject-of triples when ID is empty string # Should not create subject-of triples when ID is empty string
subject_of_triples = [t for t in triples if t.p.iri == SUBJECT_OF] subject_of_triples = [t for t in triples if t.p.iri == SUBJECT_OF]
@ -213,7 +213,7 @@ class TestAgentKgExtractionEdgeCases:
for entity in special_entities for entity in special_entities
] ]
triples, contexts = agent_extractor.process_extraction_data(data, metadata) triples, contexts, _ = agent_extractor.process_extraction_data(data, metadata)
# Verify all entities were processed # Verify all entities were processed
assert len(contexts) == len(special_entities) assert len(contexts) == len(special_entities)
@ -234,7 +234,7 @@ class TestAgentKgExtractionEdgeCases:
{"type": "definition", "entity": "Test Entity", "definition": long_definition} {"type": "definition", "entity": "Test Entity", "definition": long_definition}
] ]
triples, contexts = agent_extractor.process_extraction_data(data, metadata) triples, contexts, _ = agent_extractor.process_extraction_data(data, metadata)
# Should handle long definitions without issues # Should handle long definitions without issues
assert len(contexts) == 1 assert len(contexts) == 1
@ -256,7 +256,7 @@ class TestAgentKgExtractionEdgeCases:
{"type": "definition", "entity": "AI", "definition": "Another AI definition"}, # Duplicate {"type": "definition", "entity": "AI", "definition": "Another AI definition"}, # Duplicate
] ]
triples, contexts = agent_extractor.process_extraction_data(data, metadata) triples, contexts, _ = agent_extractor.process_extraction_data(data, metadata)
# Should process all entries (including duplicates) # Should process all entries (including duplicates)
assert len(contexts) == 4 assert len(contexts) == 4
@ -280,7 +280,7 @@ class TestAgentKgExtractionEdgeCases:
{"type": "relationship", "subject": "test", "predicate": "test", "object": "", "object-entity": True}, {"type": "relationship", "subject": "test", "predicate": "test", "object": "", "object-entity": True},
] ]
triples, contexts = agent_extractor.process_extraction_data(data, metadata) triples, contexts, _ = agent_extractor.process_extraction_data(data, metadata)
# Should handle empty strings by creating URIs (even if empty) # Should handle empty strings by creating URIs (even if empty)
assert len(contexts) == 3 assert len(contexts) == 3
@ -306,7 +306,7 @@ class TestAgentKgExtractionEdgeCases:
} }
] ]
triples, contexts = agent_extractor.process_extraction_data(data, metadata) triples, contexts, _ = agent_extractor.process_extraction_data(data, metadata)
# Should handle JSON strings in definitions without parsing them # Should handle JSON strings in definitions without parsing them
assert len(contexts) == 2 assert len(contexts) == 2
@ -334,7 +334,7 @@ class TestAgentKgExtractionEdgeCases:
{"type": "relationship", "subject": "A", "predicate": "rel7", "object": "F", "object-entity": 1}, {"type": "relationship", "subject": "A", "predicate": "rel7", "object": "F", "object-entity": 1},
] ]
triples, contexts = agent_extractor.process_extraction_data(data, metadata) triples, contexts, _ = agent_extractor.process_extraction_data(data, metadata)
# Should process all relationships # Should process all relationships
# Note: The current implementation has some logic issues that these tests document # Note: The current implementation has some logic issues that these tests document
@ -416,7 +416,7 @@ class TestAgentKgExtractionEdgeCases:
import time import time
start_time = time.time() start_time = time.time()
triples, contexts = agent_extractor.process_extraction_data(large_data, metadata) triples, contexts, _ = agent_extractor.process_extraction_data(large_data, metadata)
end_time = time.time() end_time = time.time()
processing_time = end_time - start_time processing_time = end_time - start_time

View file

@ -41,7 +41,7 @@ class QuotedTriple:
enabling statements about statements. enabling statements about statements.
Example: Example:
# stmt:123 tg:reifies <<:Hope skos:definition "A feeling...">> # subgraph:123 tg:contains <<:Hope skos:definition "A feeling...">>
qt = QuotedTriple( qt = QuotedTriple(
s=Uri("https://example.org/Hope"), s=Uri("https://example.org/Hope"),
p=Uri("http://www.w3.org/2004/02/skos/core#definition"), p=Uri("http://www.w3.org/2004/02/skos/core#definition"),

View file

@ -2,7 +2,7 @@
Provenance module for extraction-time provenance support. Provenance module for extraction-time provenance support.
Provides helpers for: Provides helpers for:
- URI generation for documents, pages, chunks, activities, statements - URI generation for documents, pages, chunks, activities, subgraphs
- PROV-O triple building for provenance metadata - PROV-O triple building for provenance metadata
- Vocabulary bootstrap for per-collection initialization - Vocabulary bootstrap for per-collection initialization
@ -38,7 +38,7 @@ from . uris import (
chunk_uri_from_page, chunk_uri_from_page,
chunk_uri_from_doc, chunk_uri_from_doc,
activity_uri, activity_uri,
statement_uri, subgraph_uri,
agent_uri, agent_uri,
# Query-time provenance URIs (GraphRAG) # Query-time provenance URIs (GraphRAG)
question_uri, question_uri,
@ -66,11 +66,13 @@ from . namespaces import (
# RDF/RDFS # RDF/RDFS
RDF, RDF_TYPE, RDFS, RDFS_LABEL, RDF, RDF_TYPE, RDFS, RDFS_LABEL,
# TrustGraph # TrustGraph
TG, TG_REIFIES, TG_PAGE_COUNT, TG_MIME_TYPE, TG_PAGE_NUMBER, TG, TG_CONTAINS, TG_PAGE_COUNT, TG_MIME_TYPE, TG_PAGE_NUMBER,
TG_CHUNK_INDEX, TG_CHAR_OFFSET, TG_CHAR_LENGTH, TG_CHUNK_INDEX, TG_CHAR_OFFSET, TG_CHAR_LENGTH,
TG_CHUNK_SIZE, TG_CHUNK_OVERLAP, TG_COMPONENT_VERSION, TG_CHUNK_SIZE, TG_CHUNK_OVERLAP, TG_COMPONENT_VERSION,
TG_LLM_MODEL, TG_ONTOLOGY, TG_EMBEDDING_MODEL, TG_LLM_MODEL, TG_ONTOLOGY, TG_EMBEDDING_MODEL,
TG_SOURCE_TEXT, TG_SOURCE_CHAR_OFFSET, TG_SOURCE_CHAR_LENGTH, TG_SOURCE_TEXT, TG_SOURCE_CHAR_OFFSET, TG_SOURCE_CHAR_LENGTH,
# Extraction provenance entity types
TG_DOCUMENT_TYPE, TG_PAGE_TYPE, TG_CHUNK_TYPE, TG_SUBGRAPH_TYPE,
# Query-time provenance predicates (GraphRAG) # Query-time provenance predicates (GraphRAG)
TG_QUERY, TG_EDGE_COUNT, TG_SELECTED_EDGE, TG_REASONING, TG_CONTENT, TG_QUERY, TG_EDGE_COUNT, TG_SELECTED_EDGE, TG_REASONING, TG_CONTENT,
# Query-time provenance predicates (DocumentRAG) # Query-time provenance predicates (DocumentRAG)
@ -94,7 +96,7 @@ from . namespaces import (
from . triples import ( from . triples import (
document_triples, document_triples,
derived_entity_triples, derived_entity_triples,
triple_provenance_triples, subgraph_provenance_triples,
# Query-time provenance triple builders (GraphRAG) # Query-time provenance triple builders (GraphRAG)
question_triples, question_triples,
exploration_triples, exploration_triples,
@ -121,6 +123,7 @@ from . vocabulary import (
PROV_CLASS_LABELS, PROV_CLASS_LABELS,
PROV_PREDICATE_LABELS, PROV_PREDICATE_LABELS,
DC_PREDICATE_LABELS, DC_PREDICATE_LABELS,
TG_CLASS_LABELS,
TG_PREDICATE_LABELS, TG_PREDICATE_LABELS,
) )
@ -132,7 +135,7 @@ __all__ = [
"chunk_uri_from_page", "chunk_uri_from_page",
"chunk_uri_from_doc", "chunk_uri_from_doc",
"activity_uri", "activity_uri",
"statement_uri", "subgraph_uri",
"agent_uri", "agent_uri",
# Query-time provenance URIs # Query-time provenance URIs
"question_uri", "question_uri",
@ -153,11 +156,13 @@ __all__ = [
"PROV_USED", "PROV_WAS_ASSOCIATED_WITH", "PROV_STARTED_AT_TIME", "PROV_USED", "PROV_WAS_ASSOCIATED_WITH", "PROV_STARTED_AT_TIME",
"DC", "DC_TITLE", "DC_SOURCE", "DC_DATE", "DC_CREATOR", "DC", "DC_TITLE", "DC_SOURCE", "DC_DATE", "DC_CREATOR",
"RDF", "RDF_TYPE", "RDFS", "RDFS_LABEL", "RDF", "RDF_TYPE", "RDFS", "RDFS_LABEL",
"TG", "TG_REIFIES", "TG_PAGE_COUNT", "TG_MIME_TYPE", "TG_PAGE_NUMBER", "TG", "TG_CONTAINS", "TG_PAGE_COUNT", "TG_MIME_TYPE", "TG_PAGE_NUMBER",
"TG_CHUNK_INDEX", "TG_CHAR_OFFSET", "TG_CHAR_LENGTH", "TG_CHUNK_INDEX", "TG_CHAR_OFFSET", "TG_CHAR_LENGTH",
"TG_CHUNK_SIZE", "TG_CHUNK_OVERLAP", "TG_COMPONENT_VERSION", "TG_CHUNK_SIZE", "TG_CHUNK_OVERLAP", "TG_COMPONENT_VERSION",
"TG_LLM_MODEL", "TG_ONTOLOGY", "TG_EMBEDDING_MODEL", "TG_LLM_MODEL", "TG_ONTOLOGY", "TG_EMBEDDING_MODEL",
"TG_SOURCE_TEXT", "TG_SOURCE_CHAR_OFFSET", "TG_SOURCE_CHAR_LENGTH", "TG_SOURCE_TEXT", "TG_SOURCE_CHAR_OFFSET", "TG_SOURCE_CHAR_LENGTH",
# Extraction provenance entity types
"TG_DOCUMENT_TYPE", "TG_PAGE_TYPE", "TG_CHUNK_TYPE", "TG_SUBGRAPH_TYPE",
# Query-time provenance predicates (GraphRAG) # Query-time provenance predicates (GraphRAG)
"TG_QUERY", "TG_EDGE_COUNT", "TG_SELECTED_EDGE", "TG_REASONING", "TG_CONTENT", "TG_QUERY", "TG_EDGE_COUNT", "TG_SELECTED_EDGE", "TG_REASONING", "TG_CONTENT",
# Query-time provenance predicates (DocumentRAG) # Query-time provenance predicates (DocumentRAG)
@ -178,7 +183,7 @@ __all__ = [
# Triple builders # Triple builders
"document_triples", "document_triples",
"derived_entity_triples", "derived_entity_triples",
"triple_provenance_triples", "subgraph_provenance_triples",
# Query-time provenance triple builders (GraphRAG) # Query-time provenance triple builders (GraphRAG)
"question_triples", "question_triples",
"exploration_triples", "exploration_triples",
@ -199,5 +204,6 @@ __all__ = [
"PROV_CLASS_LABELS", "PROV_CLASS_LABELS",
"PROV_PREDICATE_LABELS", "PROV_PREDICATE_LABELS",
"DC_PREDICATE_LABELS", "DC_PREDICATE_LABELS",
"TG_CLASS_LABELS",
"TG_PREDICATE_LABELS", "TG_PREDICATE_LABELS",
] ]

View file

@ -42,7 +42,7 @@ SKOS_DEFINITION = SKOS + "definition"
# TrustGraph namespace for custom predicates # TrustGraph namespace for custom predicates
TG = "https://trustgraph.ai/ns/" TG = "https://trustgraph.ai/ns/"
TG_REIFIES = TG + "reifies" TG_CONTAINS = TG + "contains"
TG_PAGE_COUNT = TG + "pageCount" TG_PAGE_COUNT = TG + "pageCount"
TG_MIME_TYPE = TG + "mimeType" TG_MIME_TYPE = TG + "mimeType"
TG_PAGE_NUMBER = TG + "pageNumber" TG_PAGE_NUMBER = TG + "pageNumber"
@ -72,6 +72,12 @@ TG_DOCUMENT = TG + "document" # Reference to document in librarian
TG_CHUNK_COUNT = TG + "chunkCount" TG_CHUNK_COUNT = TG + "chunkCount"
TG_SELECTED_CHUNK = TG + "selectedChunk" TG_SELECTED_CHUNK = TG + "selectedChunk"
# Extraction provenance entity types
TG_DOCUMENT_TYPE = TG + "Document"
TG_PAGE_TYPE = TG + "Page"
TG_CHUNK_TYPE = TG + "Chunk"
TG_SUBGRAPH_TYPE = TG + "Subgraph"
# Explainability entity types (shared) # Explainability entity types (shared)
TG_QUESTION = TG + "Question" TG_QUESTION = TG + "Question"
TG_EXPLORATION = TG + "Exploration" TG_EXPLORATION = TG + "Exploration"

View file

@ -16,7 +16,9 @@ from . namespaces import (
TG_PAGE_COUNT, TG_MIME_TYPE, TG_PAGE_NUMBER, TG_PAGE_COUNT, TG_MIME_TYPE, TG_PAGE_NUMBER,
TG_CHUNK_INDEX, TG_CHAR_OFFSET, TG_CHAR_LENGTH, TG_CHUNK_INDEX, TG_CHAR_OFFSET, TG_CHAR_LENGTH,
TG_CHUNK_SIZE, TG_CHUNK_OVERLAP, TG_COMPONENT_VERSION, TG_CHUNK_SIZE, TG_CHUNK_OVERLAP, TG_COMPONENT_VERSION,
TG_LLM_MODEL, TG_ONTOLOGY, TG_REIFIES, TG_LLM_MODEL, TG_ONTOLOGY, TG_CONTAINS,
# Extraction provenance entity types
TG_DOCUMENT_TYPE, TG_PAGE_TYPE, TG_CHUNK_TYPE, TG_SUBGRAPH_TYPE,
# Query-time provenance predicates (GraphRAG) # Query-time provenance predicates (GraphRAG)
TG_QUERY, TG_EDGE_COUNT, TG_SELECTED_EDGE, TG_EDGE, TG_REASONING, TG_CONTENT, TG_QUERY, TG_EDGE_COUNT, TG_SELECTED_EDGE, TG_EDGE, TG_REASONING, TG_CONTENT,
TG_DOCUMENT, TG_DOCUMENT,
@ -28,7 +30,7 @@ from . namespaces import (
TG_GRAPH_RAG_QUESTION, TG_DOC_RAG_QUESTION, TG_GRAPH_RAG_QUESTION, TG_DOC_RAG_QUESTION,
) )
from . uris import activity_uri, agent_uri, edge_selection_uri from . uris import activity_uri, agent_uri, subgraph_uri, edge_selection_uri
def set_graph(triples: List[Triple], graph: str) -> List[Triple]: def set_graph(triples: List[Triple], graph: str) -> List[Triple]:
@ -92,6 +94,7 @@ def document_triples(
""" """
triples = [ triples = [
_triple(doc_uri, RDF_TYPE, _iri(PROV_ENTITY)), _triple(doc_uri, RDF_TYPE, _iri(PROV_ENTITY)),
_triple(doc_uri, RDF_TYPE, _iri(TG_DOCUMENT_TYPE)),
] ]
if title: if title:
@ -162,10 +165,23 @@ def derived_entity_triples(
act_uri = activity_uri() act_uri = activity_uri()
agt_uri = agent_uri(component_name) agt_uri = agent_uri(component_name)
# Determine specific type from parameters
if page_number is not None:
specific_type = TG_PAGE_TYPE
elif chunk_index is not None:
specific_type = TG_CHUNK_TYPE
else:
specific_type = None
triples = [ triples = [
# Entity declaration # Entity declaration
_triple(entity_uri, RDF_TYPE, _iri(PROV_ENTITY)), _triple(entity_uri, RDF_TYPE, _iri(PROV_ENTITY)),
]
if specific_type:
triples.append(_triple(entity_uri, RDF_TYPE, _iri(specific_type)))
triples.extend([
# Derivation from parent # Derivation from parent
_triple(entity_uri, PROV_WAS_DERIVED_FROM, _iri(parent_uri)), _triple(entity_uri, PROV_WAS_DERIVED_FROM, _iri(parent_uri)),
@ -183,7 +199,7 @@ def derived_entity_triples(
# Agent declaration # Agent declaration
_triple(agt_uri, RDF_TYPE, _iri(PROV_AGENT)), _triple(agt_uri, RDF_TYPE, _iri(PROV_AGENT)),
_triple(agt_uri, RDFS_LABEL, _literal(component_name)), _triple(agt_uri, RDFS_LABEL, _literal(component_name)),
] ])
if label: if label:
triples.append(_triple(entity_uri, RDFS_LABEL, _literal(label))) triples.append(_triple(entity_uri, RDFS_LABEL, _literal(label)))
@ -209,9 +225,9 @@ def derived_entity_triples(
return triples return triples
def triple_provenance_triples( def subgraph_provenance_triples(
stmt_uri: str, subgraph_uri: str,
extracted_triple: Triple, extracted_triples: List[Triple],
chunk_uri: str, chunk_uri: str,
component_name: str, component_name: str,
component_version: str, component_version: str,
@ -220,16 +236,20 @@ def triple_provenance_triples(
timestamp: Optional[str] = None, timestamp: Optional[str] = None,
) -> List[Triple]: ) -> List[Triple]:
""" """
Build provenance triples for an extracted knowledge triple using reification. Build provenance triples for a subgraph of extracted knowledge.
One subgraph per chunk extraction, shared across all triples produced
from that chunk. This replaces per-triple reification with a
containment model.
Creates: Creates:
- Reification triple: stmt_uri tg:reifies <<extracted_triple>> - tg:contains link for each extracted triple (RDF-star quoted)
- wasDerivedFrom link to source chunk - One prov:wasDerivedFrom link to source chunk
- Activity and agent metadata - One activity with agent metadata
Args: Args:
stmt_uri: URI for the reified statement subgraph_uri: URI for the extraction subgraph
extracted_triple: The extracted Triple to reify extracted_triples: The extracted Triple objects to include
chunk_uri: URI of source chunk chunk_uri: URI of source chunk
component_name: Name of extractor component component_name: Name of extractor component
component_version: Version of the component component_version: Version of the component
@ -238,7 +258,7 @@ def triple_provenance_triples(
timestamp: ISO timestamp timestamp: ISO timestamp
Returns: Returns:
List of Triple objects for the provenance (including reification) List of Triple objects for the provenance
""" """
if timestamp is None: if timestamp is None:
timestamp = datetime.utcnow().isoformat() + "Z" timestamp = datetime.utcnow().isoformat() + "Z"
@ -246,20 +266,23 @@ def triple_provenance_triples(
act_uri = activity_uri() act_uri = activity_uri()
agt_uri = agent_uri(component_name) agt_uri = agent_uri(component_name)
# Create the quoted triple term (RDF-star reification) triples = []
triple_term = Term(type=TRIPLE, triple=extracted_triple)
triples = [ # Containment: subgraph tg:contains <<s p o>> for each extracted triple
# Reification: stmt_uri tg:reifies <<s p o>> for extracted_triple in extracted_triples:
Triple( triple_term = Term(type=TRIPLE, triple=extracted_triple)
s=_iri(stmt_uri), triples.append(Triple(
p=_iri(TG_REIFIES), s=_iri(subgraph_uri),
p=_iri(TG_CONTAINS),
o=triple_term o=triple_term
), ))
# Statement provenance # Subgraph provenance
_triple(stmt_uri, PROV_WAS_DERIVED_FROM, _iri(chunk_uri)), triples.extend([
_triple(stmt_uri, PROV_WAS_GENERATED_BY, _iri(act_uri)), _triple(subgraph_uri, RDF_TYPE, _iri(PROV_ENTITY)),
_triple(subgraph_uri, RDF_TYPE, _iri(TG_SUBGRAPH_TYPE)),
_triple(subgraph_uri, PROV_WAS_DERIVED_FROM, _iri(chunk_uri)),
_triple(subgraph_uri, PROV_WAS_GENERATED_BY, _iri(act_uri)),
# Activity # Activity
_triple(act_uri, RDF_TYPE, _iri(PROV_ACTIVITY)), _triple(act_uri, RDF_TYPE, _iri(PROV_ACTIVITY)),
@ -272,7 +295,7 @@ def triple_provenance_triples(
# Agent # Agent
_triple(agt_uri, RDF_TYPE, _iri(PROV_AGENT)), _triple(agt_uri, RDF_TYPE, _iri(PROV_AGENT)),
_triple(agt_uri, RDFS_LABEL, _literal(component_name)), _triple(agt_uri, RDFS_LABEL, _literal(component_name)),
] ])
if llm_model: if llm_model:
triples.append(_triple(act_uri, TG_LLM_MODEL, _literal(llm_model))) triples.append(_triple(act_uri, TG_LLM_MODEL, _literal(llm_model)))

View file

@ -8,7 +8,7 @@ Child entities (pages, chunks) append path segments to the parent IRI:
- Chunk: {page_iri}/c{chunk_index} (from page) - Chunk: {page_iri}/c{chunk_index} (from page)
{doc_iri}/c{chunk_index} (from text doc) {doc_iri}/c{chunk_index} (from text doc)
- Activity: https://trustgraph.ai/activity/{uuid} - Activity: https://trustgraph.ai/activity/{uuid}
- Statement: https://trustgraph.ai/stmt/{uuid} - Subgraph: https://trustgraph.ai/subgraph/{uuid}
""" """
import uuid import uuid
@ -50,11 +50,11 @@ def activity_uri(activity_id: str = None) -> str:
return f"{TRUSTGRAPH_BASE}/activity/{_encode_id(activity_id)}" return f"{TRUSTGRAPH_BASE}/activity/{_encode_id(activity_id)}"
def statement_uri(stmt_id: str = None) -> str: def subgraph_uri(subgraph_id: str = None) -> str:
"""Generate URI for a reified statement. Auto-generates UUID if not provided.""" """Generate URI for an extraction subgraph. Auto-generates UUID if not provided."""
if stmt_id is None: if subgraph_id is None:
stmt_id = str(uuid.uuid4()) subgraph_id = str(uuid.uuid4())
return f"{TRUSTGRAPH_BASE}/stmt/{_encode_id(stmt_id)}" return f"{TRUSTGRAPH_BASE}/subgraph/{_encode_id(subgraph_id)}"
def agent_uri(component_name: str) -> str: def agent_uri(component_name: str) -> str:

View file

@ -19,11 +19,12 @@ from . namespaces import (
SCHEMA_SUBJECT_OF, SCHEMA_DIGITAL_DOCUMENT, SCHEMA_DESCRIPTION, SCHEMA_SUBJECT_OF, SCHEMA_DIGITAL_DOCUMENT, SCHEMA_DESCRIPTION,
SCHEMA_KEYWORDS, SCHEMA_NAME, SCHEMA_KEYWORDS, SCHEMA_NAME,
SKOS_DEFINITION, SKOS_DEFINITION,
TG_REIFIES, TG_PAGE_COUNT, TG_MIME_TYPE, TG_PAGE_NUMBER, TG_CONTAINS, TG_PAGE_COUNT, TG_MIME_TYPE, TG_PAGE_NUMBER,
TG_CHUNK_INDEX, TG_CHAR_OFFSET, TG_CHAR_LENGTH, TG_CHUNK_INDEX, TG_CHAR_OFFSET, TG_CHAR_LENGTH,
TG_CHUNK_SIZE, TG_CHUNK_OVERLAP, TG_COMPONENT_VERSION, TG_CHUNK_SIZE, TG_CHUNK_OVERLAP, TG_COMPONENT_VERSION,
TG_LLM_MODEL, TG_ONTOLOGY, TG_EMBEDDING_MODEL, TG_LLM_MODEL, TG_ONTOLOGY, TG_EMBEDDING_MODEL,
TG_SOURCE_TEXT, TG_SOURCE_CHAR_OFFSET, TG_SOURCE_CHAR_LENGTH, TG_SOURCE_TEXT, TG_SOURCE_CHAR_OFFSET, TG_SOURCE_CHAR_LENGTH,
TG_DOCUMENT_TYPE, TG_PAGE_TYPE, TG_CHUNK_TYPE, TG_SUBGRAPH_TYPE,
) )
@ -74,9 +75,17 @@ SKOS_LABELS = [
_label_triple(SKOS_DEFINITION, "definition"), _label_triple(SKOS_DEFINITION, "definition"),
] ]
# TrustGraph class labels (extraction provenance)
TG_CLASS_LABELS = [
_label_triple(TG_DOCUMENT_TYPE, "Document"),
_label_triple(TG_PAGE_TYPE, "Page"),
_label_triple(TG_CHUNK_TYPE, "Chunk"),
_label_triple(TG_SUBGRAPH_TYPE, "Subgraph"),
]
# TrustGraph predicate labels # TrustGraph predicate labels
TG_PREDICATE_LABELS = [ TG_PREDICATE_LABELS = [
_label_triple(TG_REIFIES, "reifies"), _label_triple(TG_CONTAINS, "contains"),
_label_triple(TG_PAGE_COUNT, "page count"), _label_triple(TG_PAGE_COUNT, "page count"),
_label_triple(TG_MIME_TYPE, "MIME type"), _label_triple(TG_MIME_TYPE, "MIME type"),
_label_triple(TG_PAGE_NUMBER, "page number"), _label_triple(TG_PAGE_NUMBER, "page number"),
@ -116,5 +125,6 @@ def get_vocabulary_triples() -> List[Triple]:
DC_PREDICATE_LABELS + DC_PREDICATE_LABELS +
SCHEMA_LABELS + SCHEMA_LABELS +
SKOS_LABELS + SKOS_LABELS +
TG_CLASS_LABELS +
TG_PREDICATE_LABELS TG_PREDICATE_LABELS
) )

View file

@ -96,7 +96,7 @@ tg-delete-config-item = "trustgraph.cli.delete_config_item:main"
tg-list-collections = "trustgraph.cli.list_collections:main" tg-list-collections = "trustgraph.cli.list_collections:main"
tg-set-collection = "trustgraph.cli.set_collection:main" tg-set-collection = "trustgraph.cli.set_collection:main"
tg-delete-collection = "trustgraph.cli.delete_collection:main" tg-delete-collection = "trustgraph.cli.delete_collection:main"
tg-show-document-hierarchy = "trustgraph.cli.show_document_hierarchy:main" tg-show-extraction-provenance = "trustgraph.cli.show_extraction_provenance:main"
tg-list-explain-traces = "trustgraph.cli.list_explain_traces:main" tg-list-explain-traces = "trustgraph.cli.list_explain_traces:main"
tg-show-explain-trace = "trustgraph.cli.show_explain_trace:main" tg-show-explain-trace = "trustgraph.cli.show_explain_trace:main"

View file

@ -36,7 +36,7 @@ TG_SELECTED_EDGE = TG + "selectedEdge"
TG_EDGE = TG + "edge" TG_EDGE = TG + "edge"
TG_REASONING = TG + "reasoning" TG_REASONING = TG + "reasoning"
TG_CONTENT = TG + "content" TG_CONTENT = TG + "content"
TG_REIFIES = TG + "reifies" TG_CONTAINS = TG + "contains"
PROV = "http://www.w3.org/ns/prov#" PROV = "http://www.w3.org/ns/prov#"
PROV_STARTED_AT_TIME = PROV + "startedAtTime" PROV_STARTED_AT_TIME = PROV + "startedAtTime"
PROV_WAS_DERIVED_FROM = PROV + "wasDerivedFrom" PROV_WAS_DERIVED_FROM = PROV + "wasDerivedFrom"
@ -185,18 +185,18 @@ async def _query_edge_provenance(ws_url, flow_id, edge_s, edge_p, edge_o, user,
""" """
Query for provenance of an edge (s, p, o) in the knowledge graph. Query for provenance of an edge (s, p, o) in the knowledge graph.
Finds statements that reify the edge via tg:reifies, then follows Finds subgraphs that contain the edge via tg:contains, then follows
prov:wasDerivedFrom to find source documents. prov:wasDerivedFrom to find source documents.
Returns list of source URIs (chunks, pages, documents). Returns list of source URIs (chunks, pages, documents).
""" """
# Query for statements that reify this edge: ?stmt tg:reifies <<s p o>> # Query for subgraphs that contain this edge: ?subgraph tg:contains <<s p o>>
request = { request = {
"id": "edge-prov-request", "id": "edge-prov-request",
"service": "triples", "service": "triples",
"flow": flow_id, "flow": flow_id,
"request": { "request": {
"p": {"t": "i", "i": TG_REIFIES}, "p": {"t": "i", "i": TG_CONTAINS},
"o": { "o": {
"t": "t", # Quoted triple type "t": "t", # Quoted triple type
"tr": { "tr": {

View file

@ -40,7 +40,7 @@ SOURCE_GRAPH = "urn:graph:source"
# Provenance predicates for edge tracing # Provenance predicates for edge tracing
TG = "https://trustgraph.ai/ns/" TG = "https://trustgraph.ai/ns/"
TG_REIFIES = TG + "reifies" TG_CONTAINS = TG + "contains"
PROV = "http://www.w3.org/ns/prov#" PROV = "http://www.w3.org/ns/prov#"
PROV_WAS_DERIVED_FROM = PROV + "wasDerivedFrom" PROV_WAS_DERIVED_FROM = PROV + "wasDerivedFrom"
@ -79,10 +79,10 @@ def trace_edge_provenance(flow, user, collection, edge, label_cache, explain_cli
} }
} }
# Query: ?stmt tg:reifies <<edge>> # Query: ?subgraph tg:contains <<edge>>
try: try:
results = flow.triples_query( results = flow.triples_query(
p=TG_REIFIES, p=TG_CONTAINS,
o=quoted_triple, o=quoted_triple,
g=SOURCE_GRAPH, g=SOURCE_GRAPH,
user=user, user=user,

View file

@ -1,12 +1,12 @@
""" """
Show document hierarchy: Document -> Pages -> Chunks -> Edges. Show extraction provenance: Document -> Pages -> Chunks -> Edges.
Given a document ID, traverses and displays all derived entities Given a document ID, traverses and displays all derived entities
(pages, chunks, extracted edges) using prov:wasDerivedFrom relationships. (pages, chunks, extracted edges) using prov:wasDerivedFrom relationships.
Examples: Examples:
tg-show-document-hierarchy -U trustgraph -C default "urn:trustgraph:doc:abc123" tg-show-extraction-provenance -U trustgraph -C default "urn:trustgraph:doc:abc123"
tg-show-document-hierarchy --show-content --max-content 500 "urn:trustgraph:doc:abc123" tg-show-extraction-provenance --show-content --max-content 500 "urn:trustgraph:doc:abc123"
""" """
import argparse import argparse
@ -25,10 +25,22 @@ PROV_WAS_DERIVED_FROM = "http://www.w3.org/ns/prov#wasDerivedFrom"
RDFS_LABEL = "http://www.w3.org/2000/01/rdf-schema#label" RDFS_LABEL = "http://www.w3.org/2000/01/rdf-schema#label"
RDF_TYPE = "http://www.w3.org/1999/02/22-rdf-syntax-ns#type" RDF_TYPE = "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"
TG = "https://trustgraph.ai/ns/" TG = "https://trustgraph.ai/ns/"
TG_REIFIES = TG + "reifies" TG_CONTAINS = TG + "contains"
TG_DOCUMENT_TYPE = TG + "Document"
TG_PAGE_TYPE = TG + "Page"
TG_CHUNK_TYPE = TG + "Chunk"
TG_SUBGRAPH_TYPE = TG + "Subgraph"
DC_TITLE = "http://purl.org/dc/terms/title" DC_TITLE = "http://purl.org/dc/terms/title"
DC_FORMAT = "http://purl.org/dc/terms/format" DC_FORMAT = "http://purl.org/dc/terms/format"
# Map TrustGraph type URIs to display names
TYPE_MAP = {
TG_DOCUMENT_TYPE: "document",
TG_PAGE_TYPE: "page",
TG_CHUNK_TYPE: "chunk",
TG_SUBGRAPH_TYPE: "subgraph",
}
# Source graph # Source graph
SOURCE_GRAPH = "urn:graph:source" SOURCE_GRAPH = "urn:graph:source"
@ -109,15 +121,15 @@ def extract_value(term):
def get_node_metadata(socket, flow_id, user, collection, node_uri): def get_node_metadata(socket, flow_id, user, collection, node_uri):
"""Get metadata for a node (label, type, title, format).""" """Get metadata for a node (label, types, title, format)."""
triples = query_triples(socket, flow_id, user, collection, s=node_uri, g=SOURCE_GRAPH) triples = query_triples(socket, flow_id, user, collection, s=node_uri, g=SOURCE_GRAPH)
metadata = {"uri": node_uri} metadata = {"uri": node_uri, "types": []}
for s, p, o in triples: for s, p, o in triples:
if p == RDFS_LABEL: if p == RDFS_LABEL:
metadata["label"] = o metadata["label"] = o
elif p == RDF_TYPE: elif p == RDF_TYPE:
metadata["type"] = o metadata["types"].append(o)
elif p == DC_TITLE: elif p == DC_TITLE:
metadata["title"] = o metadata["title"] = o
elif p == DC_FORMAT: elif p == DC_FORMAT:
@ -126,6 +138,14 @@ def get_node_metadata(socket, flow_id, user, collection, node_uri):
return metadata return metadata
def classify_node(metadata):
"""Classify a node based on its rdf:type values."""
for type_uri in metadata.get("types", []):
if type_uri in TYPE_MAP:
return TYPE_MAP[type_uri]
return "unknown"
def get_children(socket, flow_id, user, collection, parent_uri): def get_children(socket, flow_id, user, collection, parent_uri):
"""Get children of a node via prov:wasDerivedFrom.""" """Get children of a node via prov:wasDerivedFrom."""
triples = query_triples( triples = query_triples(
@ -135,29 +155,6 @@ def get_children(socket, flow_id, user, collection, parent_uri):
return [s for s, p, o in triples] return [s for s, p, o in triples]
def get_edges_from_chunk(socket, flow_id, user, collection, chunk_uri):
"""Get edges that were derived from a chunk (via tg:reifies)."""
# Query for triples where: ?stmt prov:wasDerivedFrom chunk_uri
# Then get the tg:reifies value
derived_triples = query_triples(
socket, flow_id, user, collection,
p=PROV_WAS_DERIVED_FROM, o=chunk_uri, g=SOURCE_GRAPH
)
edges = []
for stmt_uri, _, _ in derived_triples:
# Get what this statement reifies
reifies_triples = query_triples(
socket, flow_id, user, collection,
s=stmt_uri, p=TG_REIFIES, g=SOURCE_GRAPH
)
for _, _, edge in reifies_triples:
if isinstance(edge, dict):
edges.append(edge)
return edges
def get_document_content(api, user, doc_id, max_content): def get_document_content(api, user, doc_id, max_content):
"""Fetch document content from librarian API.""" """Fetch document content from librarian API."""
try: try:
@ -176,32 +173,6 @@ def get_document_content(api, user, doc_id, max_content):
return f"[Error fetching content: {e}]" return f"[Error fetching content: {e}]"
def classify_uri(uri):
"""Classify a URI as document, page, or chunk based on patterns."""
if not isinstance(uri, str):
return "unknown"
# Common patterns in trustgraph URIs
if "/c" in uri and uri.split("/c")[-1].isdigit():
return "chunk"
if "/p" in uri and any(uri.split("/p")[-1].replace("/", "").isdigit() for _ in [1]):
# Check for page pattern like /p1 or /p1/
parts = uri.split("/p")
if len(parts) > 1:
remainder = parts[-1].split("/")[0]
if remainder.isdigit():
return "page"
if "chunk" in uri.lower():
return "chunk"
if "page" in uri.lower():
return "page"
if "doc" in uri.lower():
return "document"
return "unknown"
def build_hierarchy(socket, flow_id, user, collection, root_uri, api=None, show_content=False, max_content=200, visited=None): def build_hierarchy(socket, flow_id, user, collection, root_uri, api=None, show_content=False, max_content=200, visited=None):
"""Build document hierarchy tree recursively.""" """Build document hierarchy tree recursively."""
if visited is None: if visited is None:
@ -212,7 +183,7 @@ def build_hierarchy(socket, flow_id, user, collection, root_uri, api=None, show_
visited.add(root_uri) visited.add(root_uri)
metadata = get_node_metadata(socket, flow_id, user, collection, root_uri) metadata = get_node_metadata(socket, flow_id, user, collection, root_uri)
node_type = classify_uri(root_uri) node_type = classify_node(metadata)
node = { node = {
"uri": root_uri, "uri": root_uri,
@ -232,10 +203,20 @@ def build_hierarchy(socket, flow_id, user, collection, root_uri, api=None, show_
children_uris = get_children(socket, flow_id, user, collection, root_uri) children_uris = get_children(socket, flow_id, user, collection, root_uri)
for child_uri in children_uris: for child_uri in children_uris:
child_type = classify_uri(child_uri) child_metadata = get_node_metadata(socket, flow_id, user, collection, child_uri)
child_type = classify_node(child_metadata)
# Recursively build hierarchy for pages and chunks if child_type == "subgraph":
if child_type in ("page", "chunk", "unknown"): # Subgraphs contain extracted edges — inline them
contains_triples = query_triples(
socket, flow_id, user, collection,
s=child_uri, p=TG_CONTAINS, g=SOURCE_GRAPH
)
for _, _, edge in contains_triples:
if isinstance(edge, dict):
node["edges"].append(edge)
else:
# Recurse into pages, chunks, etc.
child_node = build_hierarchy( child_node = build_hierarchy(
socket, flow_id, user, collection, child_uri, socket, flow_id, user, collection, child_uri,
api=api, show_content=show_content, max_content=max_content, api=api, show_content=show_content, max_content=max_content,
@ -244,11 +225,6 @@ def build_hierarchy(socket, flow_id, user, collection, root_uri, api=None, show_
if child_node: if child_node:
node["children"].append(child_node) node["children"].append(child_node)
# Get edges for chunks
if node_type == "chunk":
edges = get_edges_from_chunk(socket, flow_id, user, collection, root_uri)
node["edges"] = edges
# Sort children by URI for consistent output # Sort children by URI for consistent output
node["children"].sort(key=lambda x: x.get("uri", "")) node["children"].sort(key=lambda x: x.get("uri", ""))
@ -332,7 +308,7 @@ def print_json(node):
def main(): def main():
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
prog='tg-show-document-hierarchy', prog='tg-show-extraction-provenance',
description=__doc__, description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter, formatter_class=argparse.RawDescriptionHelpFormatter,
) )

View file

@ -11,6 +11,8 @@ from ....rdf import TRUSTGRAPH_ENTITIES, RDF_LABEL, SUBJECT_OF, DEFINITION
from ....base import FlowProcessor, ConsumerSpec, ProducerSpec from ....base import FlowProcessor, ConsumerSpec, ProducerSpec
from ....base import AgentClientSpec from ....base import AgentClientSpec
from ....provenance import subgraph_uri, subgraph_provenance_triples, set_graph, GRAPH_SOURCE
from ....flow_version import __version__ as COMPONENT_VERSION
from ....template import PromptManager from ....template import PromptManager
# Module logger # Module logger
@ -196,9 +198,21 @@ class Processor(FlowProcessor):
return return
# Process extraction data # Process extraction data
triples, entity_contexts = self.process_extraction_data( triples, entity_contexts, extracted_triples = \
extraction_data, v.metadata self.process_extraction_data(extraction_data, v.metadata)
)
# Generate subgraph provenance for extracted triples
if extracted_triples:
chunk_uri = v.metadata.id
sg_uri = subgraph_uri()
prov_triples = subgraph_provenance_triples(
subgraph_uri=sg_uri,
extracted_triples=extracted_triples,
chunk_uri=chunk_uri,
component_name=default_ident,
component_version=COMPONENT_VERSION,
)
triples.extend(set_graph(prov_triples, GRAPH_SOURCE))
# Emit outputs # Emit outputs
if triples: if triples:
@ -221,8 +235,13 @@ class Processor(FlowProcessor):
Data is a flat list of objects with 'type' discriminator field: Data is a flat list of objects with 'type' discriminator field:
- {"type": "definition", "entity": "...", "definition": "..."} - {"type": "definition", "entity": "...", "definition": "..."}
- {"type": "relationship", "subject": "...", "predicate": "...", "object": "...", "object-entity": bool} - {"type": "relationship", "subject": "...", "predicate": "...", "object": "...", "object-entity": bool}
Returns:
Tuple of (all_triples, entity_contexts, extracted_triples) where
extracted_triples contains only the core knowledge facts (for provenance).
""" """
triples = [] triples = []
extracted_triples = []
entity_contexts = [] entity_contexts = []
# Categorize items by type # Categorize items by type
@ -242,11 +261,13 @@ class Processor(FlowProcessor):
)) ))
# Add definition # Add definition
triples.append(Triple( definition_triple = Triple(
s = Term(type=IRI, iri=entity_uri), s = Term(type=IRI, iri=entity_uri),
p = Term(type=IRI, iri=DEFINITION), p = Term(type=IRI, iri=DEFINITION),
o = Term(type=LITERAL, value=defn["definition"]), o = Term(type=LITERAL, value=defn["definition"]),
)) )
triples.append(definition_triple)
extracted_triples.append(definition_triple)
# Add subject-of relationship to document # Add subject-of relationship to document
if metadata.id: if metadata.id:
@ -298,11 +319,13 @@ class Processor(FlowProcessor):
)) ))
# Add the main relationship triple # Add the main relationship triple
triples.append(Triple( relationship_triple = Triple(
s = subject_value, s = subject_value,
p = predicate_value, p = predicate_value,
o = object_value o = object_value
)) )
triples.append(relationship_triple)
extracted_triples.append(relationship_triple)
# Add subject-of relationships to document # Add subject-of relationships to document
if metadata.id: if metadata.id:
@ -325,7 +348,7 @@ class Processor(FlowProcessor):
o = Term(type=IRI, iri=metadata.id), o = Term(type=IRI, iri=metadata.id),
)) ))
return triples, entity_contexts return triples, entity_contexts, extracted_triples
@staticmethod @staticmethod
def add_args(parser): def add_args(parser):

View file

@ -20,7 +20,7 @@ from .... rdf import TRUSTGRAPH_ENTITIES, DEFINITION, RDF_LABEL, SUBJECT_OF
from .... base import FlowProcessor, ConsumerSpec, ProducerSpec from .... base import FlowProcessor, ConsumerSpec, ProducerSpec
from .... base import PromptClientSpec, ParameterSpec from .... base import PromptClientSpec, ParameterSpec
from .... provenance import statement_uri, triple_provenance_triples, set_graph, GRAPH_SOURCE from .... provenance import subgraph_uri, subgraph_provenance_triples, set_graph, GRAPH_SOURCE
from .... flow_version import __version__ as COMPONENT_VERSION from .... flow_version import __version__ as COMPONENT_VERSION
DEFINITION_VALUE = Term(type=IRI, iri=DEFINITION) DEFINITION_VALUE = Term(type=IRI, iri=DEFINITION)
@ -133,6 +133,7 @@ class Processor(FlowProcessor):
raise e raise e
triples = [] triples = []
extracted_triples = []
entities = [] entities = []
# Get chunk document ID for provenance linking # Get chunk document ID for provenance linking
@ -173,20 +174,7 @@ class Processor(FlowProcessor):
s=s_value, p=DEFINITION_VALUE, o=o_value s=s_value, p=DEFINITION_VALUE, o=o_value
) )
triples.append(definition_triple) triples.append(definition_triple)
extracted_triples.append(definition_triple)
# Generate provenance for the definition triple (reification)
# Provenance triples go in the source graph for separation from core knowledge
stmt_uri = statement_uri()
prov_triples = triple_provenance_triples(
stmt_uri=stmt_uri,
extracted_triple=definition_triple,
chunk_uri=chunk_uri,
component_name=default_ident,
component_version=COMPONENT_VERSION,
llm_model=llm_model,
ontology_uri=ontology_uri,
)
triples.extend(set_graph(prov_triples, GRAPH_SOURCE))
# Link entity to chunk (not top-level document) # Link entity to chunk (not top-level document)
triples.append(Triple( triples.append(Triple(
@ -211,6 +199,20 @@ class Processor(FlowProcessor):
chunk_id=chunk_doc_id, chunk_id=chunk_doc_id,
)) ))
# Generate subgraph provenance once for all extracted triples
if extracted_triples:
sg_uri = subgraph_uri()
prov_triples = subgraph_provenance_triples(
subgraph_uri=sg_uri,
extracted_triples=extracted_triples,
chunk_uri=chunk_uri,
component_name=default_ident,
component_version=COMPONENT_VERSION,
llm_model=llm_model,
ontology_uri=ontology_uri,
)
triples.extend(set_graph(prov_triples, GRAPH_SOURCE))
# Send triples in batches # Send triples in batches
for i in range(0, len(triples), self.triples_batch_size): for i in range(0, len(triples), self.triples_batch_size):
batch = triples[i:i + self.triples_batch_size] batch = triples[i:i + self.triples_batch_size]

View file

@ -23,6 +23,9 @@ from .ontology_selector import OntologySelector, OntologySubset
from .simplified_parser import parse_extraction_response from .simplified_parser import parse_extraction_response
from .triple_converter import TripleConverter from .triple_converter import TripleConverter
from .... provenance import subgraph_uri, subgraph_provenance_triples, set_graph, GRAPH_SOURCE
from .... flow_version import __version__ as COMPONENT_VERSION
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
default_ident = "kg-extract-ontology" default_ident = "kg-extract-ontology"
@ -306,11 +309,25 @@ class Processor(FlowProcessor):
flow, chunk, ontology_subset, prompt_variables flow, chunk, ontology_subset, prompt_variables
) )
# Generate subgraph provenance for extracted triples
if triples:
chunk_uri = v.metadata.id
sg_uri = subgraph_uri()
prov_triples = subgraph_provenance_triples(
subgraph_uri=sg_uri,
extracted_triples=triples,
chunk_uri=chunk_uri,
component_name=default_ident,
component_version=COMPONENT_VERSION,
)
# Generate ontology definition triples # Generate ontology definition triples
ontology_triples = self.build_ontology_triples(ontology_subset) ontology_triples = self.build_ontology_triples(ontology_subset)
# Combine extracted triples with ontology triples # Combine extracted triples with ontology triples and provenance
all_triples = triples + ontology_triples all_triples = triples + ontology_triples
if triples:
all_triples.extend(set_graph(prov_triples, GRAPH_SOURCE))
# Build entity contexts from all triples (including ontology elements) # Build entity contexts from all triples (including ontology elements)
entity_contexts = self.build_entity_contexts(all_triples) entity_contexts = self.build_entity_contexts(all_triples)

View file

@ -20,7 +20,7 @@ from .... rdf import RDF_LABEL, TRUSTGRAPH_ENTITIES, SUBJECT_OF
from .... base import FlowProcessor, ConsumerSpec, ProducerSpec from .... base import FlowProcessor, ConsumerSpec, ProducerSpec
from .... base import PromptClientSpec, ParameterSpec from .... base import PromptClientSpec, ParameterSpec
from .... provenance import statement_uri, triple_provenance_triples, set_graph, GRAPH_SOURCE from .... provenance import subgraph_uri, subgraph_provenance_triples, set_graph, GRAPH_SOURCE
from .... flow_version import __version__ as COMPONENT_VERSION from .... flow_version import __version__ as COMPONENT_VERSION
RDF_LABEL_VALUE = Term(type=IRI, iri=RDF_LABEL) RDF_LABEL_VALUE = Term(type=IRI, iri=RDF_LABEL)
@ -115,6 +115,7 @@ class Processor(FlowProcessor):
raise e raise e
triples = [] triples = []
extracted_triples = []
# Get chunk document ID for provenance linking # Get chunk document ID for provenance linking
chunk_doc_id = v.document_id if v.document_id else v.metadata.id chunk_doc_id = v.document_id if v.document_id else v.metadata.id
@ -160,20 +161,7 @@ class Processor(FlowProcessor):
o=o_value o=o_value
) )
triples.append(relationship_triple) triples.append(relationship_triple)
extracted_triples.append(relationship_triple)
# Generate provenance for the relationship triple (reification)
# Provenance triples go in the source graph for separation from core knowledge
stmt_uri = statement_uri()
prov_triples = triple_provenance_triples(
stmt_uri=stmt_uri,
extracted_triple=relationship_triple,
chunk_uri=chunk_uri,
component_name=default_ident,
component_version=COMPONENT_VERSION,
llm_model=llm_model,
ontology_uri=ontology_uri,
)
triples.extend(set_graph(prov_triples, GRAPH_SOURCE))
# Label for s # Label for s
triples.append(Triple( triples.append(Triple(
@ -212,6 +200,20 @@ class Processor(FlowProcessor):
o=Term(type=IRI, iri=chunk_uri) o=Term(type=IRI, iri=chunk_uri)
)) ))
# Generate subgraph provenance once for all extracted triples
if extracted_triples:
sg_uri = subgraph_uri()
prov_triples = subgraph_provenance_triples(
subgraph_uri=sg_uri,
extracted_triples=extracted_triples,
chunk_uri=chunk_uri,
component_name=default_ident,
component_version=COMPONENT_VERSION,
llm_model=llm_model,
ontology_uri=ontology_uri,
)
triples.extend(set_graph(prov_triples, GRAPH_SOURCE))
# Send triples in batches # Send triples in batches
for i in range(0, len(triples), self.triples_batch_size): for i in range(0, len(triples), self.triples_batch_size):
batch = triples[i:i + self.triples_batch_size] batch = triples[i:i + self.triples_batch_size]