mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 00:16:23 +02:00
Subgraph provenance (#694)
Replace per-triple provenance reification with subgraph model Extraction provenance previously created a full reification (statement URI, activity, agent) for every single extracted triple, producing ~13 provenance triples per knowledge triple. Since each chunk is processed by a single LLM call, this was both redundant and semantically inaccurate. Now one subgraph object is created per chunk extraction, with tg:contains linking to each extracted triple. For 20 extractions from a chunk this reduces provenance from ~260 triples to ~33. - Rename tg:reifies -> tg:contains, stmt_uri -> subgraph_uri - Replace triple_provenance_triples() with subgraph_provenance_triples() - Refactor kg-extract-definitions and kg-extract-relationships to generate provenance once per chunk instead of per triple - Add subgraph provenance to kg-extract-ontology and kg-extract-agent (previously had none) - Update CLI tools and tech specs to match Also rename tg-show-document-hierarchy to tg-show-extraction-provenance. Added extra typing for extraction provenance, fixed extraction prov CLI
This commit is contained in:
parent
35128ff019
commit
64e3f6bd0d
20 changed files with 463 additions and 193 deletions
205
docs/tech-specs/extraction-provenance-subgraph.md
Normal file
205
docs/tech-specs/extraction-provenance-subgraph.md
Normal file
|
|
@ -0,0 +1,205 @@
|
|||
# Extraction Provenance: Subgraph Model
|
||||
|
||||
## Problem
|
||||
|
||||
Extraction-time provenance currently generates a full reification per
|
||||
extracted triple: a unique `stmt_uri`, `activity_uri`, and associated
|
||||
PROV-O metadata for every single knowledge fact. Processing one chunk
|
||||
that yields 20 relationships produces ~220 provenance triples on top of
|
||||
the ~20 knowledge triples — a roughly 10:1 overhead.
|
||||
|
||||
This is both expensive (storage, indexing, transmission) and semantically
|
||||
inaccurate. Each chunk is processed by a single LLM call that produces
|
||||
all its triples in one transaction. The current per-triple model
|
||||
obscures that by creating the illusion of 20 independent extraction
|
||||
events.
|
||||
|
||||
Additionally, two of the four extraction processors (kg-extract-ontology,
|
||||
kg-extract-agent) have no provenance at all, leaving gaps in the audit
|
||||
trail.
|
||||
|
||||
## Solution
|
||||
|
||||
Replace per-triple reification with a **subgraph model**: one provenance
|
||||
record per chunk extraction, shared across all triples produced from that
|
||||
chunk.
|
||||
|
||||
### Terminology Change
|
||||
|
||||
| Old | New |
|
||||
|-----|-----|
|
||||
| `stmt_uri` (`https://trustgraph.ai/stmt/{uuid}`) | `subgraph_uri` (`https://trustgraph.ai/subgraph/{uuid}`) |
|
||||
| `statement_uri()` | `subgraph_uri()` |
|
||||
| `tg:reifies` (1:1, identity) | `tg:contains` (1:many, containment) |
|
||||
|
||||
### Target Structure
|
||||
|
||||
All provenance triples go in the `urn:graph:source` named graph.
|
||||
|
||||
```
|
||||
# Subgraph contains each extracted triple (RDF-star quoted triples)
|
||||
<subgraph> tg:contains <<s1 p1 o1>> .
|
||||
<subgraph> tg:contains <<s2 p2 o2>> .
|
||||
<subgraph> tg:contains <<s3 p3 o3>> .
|
||||
|
||||
# Derivation from source chunk
|
||||
<subgraph> prov:wasDerivedFrom <chunk_uri> .
|
||||
<subgraph> prov:wasGeneratedBy <activity> .
|
||||
|
||||
# Activity: one per chunk extraction
|
||||
<activity> rdf:type prov:Activity .
|
||||
<activity> rdfs:label "{component_name} extraction" .
|
||||
<activity> prov:used <chunk_uri> .
|
||||
<activity> prov:wasAssociatedWith <agent> .
|
||||
<activity> prov:startedAtTime "2026-03-13T10:00:00Z" .
|
||||
<activity> tg:componentVersion "0.25.0" .
|
||||
<activity> tg:llmModel "gpt-4" . # if available
|
||||
<activity> tg:ontology <ontology_uri> . # if available
|
||||
|
||||
# Agent: stable per component
|
||||
<agent> rdf:type prov:Agent .
|
||||
<agent> rdfs:label "{component_name}" .
|
||||
```
|
||||
|
||||
### Volume Comparison
|
||||
|
||||
For a chunk producing N extracted triples:
|
||||
|
||||
| | Old (per-triple) | New (subgraph) |
|
||||
|---|---|---|
|
||||
| `tg:contains` / `tg:reifies` | N | N |
|
||||
| Activity triples | ~9 x N | ~9 |
|
||||
| Agent triples | 2 x N | 2 |
|
||||
| Statement/subgraph metadata | 2 x N | 2 |
|
||||
| **Total provenance triples** | **~13N** | **N + 13** |
|
||||
| **Example (N=20)** | **~260** | **33** |
|
||||
|
||||
## Scope
|
||||
|
||||
### Processors to Update (existing provenance, per-triple)
|
||||
|
||||
**kg-extract-definitions**
|
||||
(`trustgraph-flow/trustgraph/extract/kg/definitions/extract.py`)
|
||||
|
||||
Currently calls `statement_uri()` + `triple_provenance_triples()` inside
|
||||
the per-definition loop.
|
||||
|
||||
Changes:
|
||||
- Move `subgraph_uri()` and `activity_uri()` creation before the loop
|
||||
- Collect `tg:contains` triples inside the loop
|
||||
- Emit shared activity/agent/derivation block once after the loop
|
||||
|
||||
**kg-extract-relationships**
|
||||
(`trustgraph-flow/trustgraph/extract/kg/relationships/extract.py`)
|
||||
|
||||
Same pattern as definitions. Same changes.
|
||||
|
||||
### Processors to Add Provenance (currently missing)
|
||||
|
||||
**kg-extract-ontology**
|
||||
(`trustgraph-flow/trustgraph/extract/kg/ontology/extract.py`)
|
||||
|
||||
Currently emits triples with no provenance. Add subgraph provenance
|
||||
using the same pattern: one subgraph per chunk, `tg:contains` for each
|
||||
extracted triple.
|
||||
|
||||
**kg-extract-agent**
|
||||
(`trustgraph-flow/trustgraph/extract/kg/agent/extract.py`)
|
||||
|
||||
Currently emits triples with no provenance. Add subgraph provenance
|
||||
using the same pattern.
|
||||
|
||||
### Shared Provenance Library Changes
|
||||
|
||||
**`trustgraph-base/trustgraph/provenance/triples.py`**
|
||||
|
||||
- Replace `triple_provenance_triples()` with `subgraph_provenance_triples()`
|
||||
- New function accepts a list of extracted triples instead of a single one
|
||||
- Generates one `tg:contains` per triple, shared activity/agent block
|
||||
- Remove old `triple_provenance_triples()`
|
||||
|
||||
**`trustgraph-base/trustgraph/provenance/uris.py`**
|
||||
|
||||
- Replace `statement_uri()` with `subgraph_uri()`
|
||||
|
||||
**`trustgraph-base/trustgraph/provenance/namespaces.py`**
|
||||
|
||||
- Replace `TG_REIFIES` with `TG_CONTAINS`
|
||||
|
||||
### Not in Scope
|
||||
|
||||
- **kg-extract-topics**: older-style processor, not currently used in
|
||||
standard flows
|
||||
- **kg-extract-rows**: produces rows not triples, different provenance
|
||||
model
|
||||
- **Query-time provenance** (`urn:graph:retrieval`): separate concern,
|
||||
already uses a different pattern (question/exploration/focus/synthesis)
|
||||
- **Document/page/chunk provenance** (PDF decoder, chunker): already uses
|
||||
`derived_entity_triples()` which is per-entity, not per-triple — no
|
||||
redundancy issue
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
### Processor Loop Restructure
|
||||
|
||||
Before (per-triple, in relationships):
|
||||
```python
|
||||
for rel in rels:
|
||||
# ... build relationship_triple ...
|
||||
stmt_uri = statement_uri()
|
||||
prov_triples = triple_provenance_triples(
|
||||
stmt_uri=stmt_uri,
|
||||
extracted_triple=relationship_triple,
|
||||
...
|
||||
)
|
||||
triples.extend(set_graph(prov_triples, GRAPH_SOURCE))
|
||||
```
|
||||
|
||||
After (subgraph):
|
||||
```python
|
||||
sg_uri = subgraph_uri()
|
||||
|
||||
for rel in rels:
|
||||
# ... build relationship_triple ...
|
||||
extracted_triples.append(relationship_triple)
|
||||
|
||||
prov_triples = subgraph_provenance_triples(
|
||||
subgraph_uri=sg_uri,
|
||||
extracted_triples=extracted_triples,
|
||||
chunk_uri=chunk_uri,
|
||||
component_name=default_ident,
|
||||
component_version=COMPONENT_VERSION,
|
||||
llm_model=llm_model,
|
||||
ontology_uri=ontology_uri,
|
||||
)
|
||||
triples.extend(set_graph(prov_triples, GRAPH_SOURCE))
|
||||
```
|
||||
|
||||
### New Helper Signature
|
||||
|
||||
```python
|
||||
def subgraph_provenance_triples(
|
||||
subgraph_uri: str,
|
||||
extracted_triples: List[Triple],
|
||||
chunk_uri: str,
|
||||
component_name: str,
|
||||
component_version: str,
|
||||
llm_model: Optional[str] = None,
|
||||
ontology_uri: Optional[str] = None,
|
||||
timestamp: Optional[str] = None,
|
||||
) -> List[Triple]:
|
||||
"""
|
||||
Build provenance triples for a subgraph of extracted knowledge.
|
||||
|
||||
Creates:
|
||||
- tg:contains link for each extracted triple (RDF-star quoted)
|
||||
- One prov:wasDerivedFrom link to source chunk
|
||||
- One activity with agent metadata
|
||||
"""
|
||||
```
|
||||
|
||||
### Breaking Change
|
||||
|
||||
This is a breaking change to the provenance model. Provenance has not
|
||||
been released, so no migration is needed. The old `tg:reifies` /
|
||||
`statement_uri` code can be removed outright.
|
||||
|
|
@ -311,10 +311,10 @@ activity:chunk-789 tg:chunkOverlap 200 .
|
|||
# The extracted triple (edge)
|
||||
entity:JohnSmith rel:worksAt entity:AcmeCorp .
|
||||
|
||||
# Statement object pointing at the edge (RDF 1.2 reification)
|
||||
stmt:001 tg:reifies <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
|
||||
stmt:001 prov:wasDerivedFrom chunk:123-1-1 .
|
||||
stmt:001 prov:wasGeneratedBy activity:extract-999 .
|
||||
# Subgraph containing the extracted triples
|
||||
subgraph:001 tg:contains <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
|
||||
subgraph:001 prov:wasDerivedFrom chunk:123-1-1 .
|
||||
subgraph:001 prov:wasGeneratedBy activity:extract-999 .
|
||||
|
||||
activity:extract-999 a prov:Activity .
|
||||
activity:extract-999 prov:used chunk:123-1-1 .
|
||||
|
|
@ -344,7 +344,7 @@ Custom predicates under the `tg:` namespace for extraction-specific metadata:
|
|||
|
||||
| Predicate | Domain | Description |
|
||||
|-----------|--------|-------------|
|
||||
| `tg:reifies` | Statement | Points at the triple this statement object represents |
|
||||
| `tg:contains` | Subgraph | Points at a triple contained in this extraction subgraph |
|
||||
| `tg:pageCount` | Document | Total number of pages in source document |
|
||||
| `tg:mimeType` | Document | MIME type of source document |
|
||||
| `tg:pageNumber` | Page | Page number in source document |
|
||||
|
|
@ -383,7 +383,7 @@ prov:startedAtTime rdfs:label "started at" .
|
|||
|
||||
**TrustGraph Predicates:**
|
||||
```
|
||||
tg:reifies rdfs:label "reifies" .
|
||||
tg:contains rdfs:label "contains" .
|
||||
tg:pageCount rdfs:label "page count" .
|
||||
tg:mimeType rdfs:label "MIME type" .
|
||||
tg:pageNumber rdfs:label "page number" .
|
||||
|
|
@ -416,20 +416,20 @@ For finer-grained provenance, it would be valuable to record exactly where withi
|
|||
# The extracted triple
|
||||
entity:JohnSmith rel:worksAt entity:AcmeCorp .
|
||||
|
||||
# Statement with sub-chunk provenance
|
||||
stmt:001 tg:reifies <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
|
||||
stmt:001 prov:wasDerivedFrom chunk:123-1-1 .
|
||||
stmt:001 tg:sourceText "John Smith has worked at Acme Corp since 2019" .
|
||||
stmt:001 tg:sourceCharOffset 1547 .
|
||||
stmt:001 tg:sourceCharLength 46 .
|
||||
# Subgraph with sub-chunk provenance
|
||||
subgraph:001 tg:contains <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
|
||||
subgraph:001 prov:wasDerivedFrom chunk:123-1-1 .
|
||||
subgraph:001 tg:sourceText "John Smith has worked at Acme Corp since 2019" .
|
||||
subgraph:001 tg:sourceCharOffset 1547 .
|
||||
subgraph:001 tg:sourceCharLength 46 .
|
||||
```
|
||||
|
||||
**Example with text range (alternative):**
|
||||
```
|
||||
stmt:001 tg:reifies <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
|
||||
stmt:001 prov:wasDerivedFrom chunk:123-1-1 .
|
||||
stmt:001 tg:sourceRange "1547-1593" .
|
||||
stmt:001 tg:sourceText "John Smith has worked at Acme Corp since 2019" .
|
||||
subgraph:001 tg:contains <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
|
||||
subgraph:001 prov:wasDerivedFrom chunk:123-1-1 .
|
||||
subgraph:001 tg:sourceRange "1547-1593" .
|
||||
subgraph:001 tg:sourceText "John Smith has worked at Acme Corp since 2019" .
|
||||
```
|
||||
|
||||
**Implementation considerations:**
|
||||
|
|
|
|||
|
|
@ -193,7 +193,7 @@ When storing explainability data, URIs from `uri_map` are used.
|
|||
|
||||
Selected edges can be traced back to source documents:
|
||||
|
||||
1. Query for reifying statement: `?stmt tg:reifies <<s p o>>`
|
||||
1. Query for containing subgraph: `?subgraph tg:contains <<s p o>>`
|
||||
2. Follow `prov:wasDerivedFrom` chain to root document
|
||||
3. Each step in chain: chunk → page → document
|
||||
|
||||
|
|
@ -209,7 +209,7 @@ elif term.type == TRIPLE:
|
|||
|
||||
This enables queries like:
|
||||
```
|
||||
?stmt tg:reifies <<http://example.org/s http://example.org/p "value">>
|
||||
?subgraph tg:contains <<http://example.org/s http://example.org/p "value">>
|
||||
```
|
||||
|
||||
## CLI Usage
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue