mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 08:26:21 +02:00
Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.
215 lines
6.5 KiB
Markdown
215 lines
6.5 KiB
Markdown
---
|
|
layout: default
|
|
title: "Extraction Provenance: Subgraph Model"
|
|
parent: "Tech Specs"
|
|
---
|
|
|
|
# Extraction Provenance: Subgraph Model
|
|
|
|
## Problem
|
|
|
|
Extraction-time provenance currently generates a full reification per
|
|
extracted triple: a unique `stmt_uri`, `activity_uri`, and associated
|
|
PROV-O metadata for every single knowledge fact. Processing one chunk
|
|
that yields 20 relationships produces ~220 provenance triples on top of
|
|
the ~20 knowledge triples — a roughly 10:1 overhead.
|
|
|
|
This is both expensive (storage, indexing, transmission) and semantically
|
|
inaccurate. Each chunk is processed by a single LLM call that produces
|
|
all its triples in one transaction. The current per-triple model
|
|
obscures that by creating the illusion of 20 independent extraction
|
|
events.
|
|
|
|
Additionally, two of the four extraction processors (kg-extract-ontology,
|
|
kg-extract-agent) have no provenance at all, leaving gaps in the audit
|
|
trail.
|
|
|
|
## Solution
|
|
|
|
Replace per-triple reification with a **subgraph model**: one provenance
|
|
record per chunk extraction, shared across all triples produced from that
|
|
chunk.
|
|
|
|
### Terminology Change
|
|
|
|
| Old | New |
|
|
|-----|-----|
|
|
| `stmt_uri` (`https://trustgraph.ai/stmt/{uuid}`) | `subgraph_uri` (`https://trustgraph.ai/subgraph/{uuid}`) |
|
|
| `statement_uri()` | `subgraph_uri()` |
|
|
| `tg:reifies` (1:1, identity) | `tg:contains` (1:many, containment) |
|
|
|
|
### Target Structure
|
|
|
|
All provenance triples go in the `urn:graph:source` named graph.
|
|
|
|
```
|
|
# Subgraph contains each extracted triple (RDF-star quoted triples)
|
|
<subgraph> tg:contains <<s1 p1 o1>> .
|
|
<subgraph> tg:contains <<s2 p2 o2>> .
|
|
<subgraph> tg:contains <<s3 p3 o3>> .
|
|
|
|
# Derivation from source chunk
|
|
<subgraph> prov:wasDerivedFrom <chunk_uri> .
|
|
<subgraph> prov:wasGeneratedBy <activity> .
|
|
|
|
# Activity: one per chunk extraction
|
|
<activity> rdf:type prov:Activity .
|
|
<activity> rdfs:label "{component_name} extraction" .
|
|
<activity> prov:used <chunk_uri> .
|
|
<activity> prov:wasAssociatedWith <agent> .
|
|
<activity> prov:startedAtTime "2026-03-13T10:00:00Z" .
|
|
<activity> tg:componentVersion "0.25.0" .
|
|
<activity> tg:llmModel "gpt-4" . # if available
|
|
<activity> tg:ontology <ontology_uri> . # if available
|
|
|
|
# Agent: stable per component
|
|
<agent> rdf:type prov:Agent .
|
|
<agent> rdfs:label "{component_name}" .
|
|
```
|
|
|
|
### Volume Comparison
|
|
|
|
For a chunk producing N extracted triples:
|
|
|
|
| | Old (per-triple) | New (subgraph) |
|
|
|---|---|---|
|
|
| `tg:contains` / `tg:reifies` | N | N |
|
|
| Activity triples | ~9 x N | ~9 |
|
|
| Agent triples | 2 x N | 2 |
|
|
| Statement/subgraph metadata | 2 x N | 2 |
|
|
| **Total provenance triples** | **~13N** | **N + 13** |
|
|
| **Example (N=20)** | **~260** | **33** |
|
|
|
|
## Scope
|
|
|
|
### Processors to Update (existing provenance, per-triple)
|
|
|
|
**kg-extract-definitions**
|
|
(`trustgraph-flow/trustgraph/extract/kg/definitions/extract.py`)
|
|
|
|
Currently calls `statement_uri()` + `triple_provenance_triples()` inside
|
|
the per-definition loop.
|
|
|
|
Changes:
|
|
- Move `subgraph_uri()` and `activity_uri()` creation before the loop
|
|
- Collect `tg:contains` triples inside the loop
|
|
- Emit shared activity/agent/derivation block once after the loop
|
|
|
|
**kg-extract-relationships**
|
|
(`trustgraph-flow/trustgraph/extract/kg/relationships/extract.py`)
|
|
|
|
Same pattern as definitions. Same changes.
|
|
|
|
### Processors to Add Provenance (currently missing)
|
|
|
|
**kg-extract-ontology**
|
|
(`trustgraph-flow/trustgraph/extract/kg/ontology/extract.py`)
|
|
|
|
Currently emits triples with no provenance. Add subgraph provenance
|
|
using the same pattern: one subgraph per chunk, `tg:contains` for each
|
|
extracted triple.
|
|
|
|
**kg-extract-agent**
|
|
(`trustgraph-flow/trustgraph/extract/kg/agent/extract.py`)
|
|
|
|
Currently emits triples with no provenance. Add subgraph provenance
|
|
using the same pattern.
|
|
|
|
### Shared Provenance Library Changes
|
|
|
|
**`trustgraph-base/trustgraph/provenance/triples.py`**
|
|
|
|
- Replace `triple_provenance_triples()` with `subgraph_provenance_triples()`
|
|
- New function accepts a list of extracted triples instead of a single one
|
|
- Generates one `tg:contains` per triple, shared activity/agent block
|
|
- Remove old `triple_provenance_triples()`
|
|
|
|
**`trustgraph-base/trustgraph/provenance/uris.py`**
|
|
|
|
- Replace `statement_uri()` with `subgraph_uri()`
|
|
|
|
**`trustgraph-base/trustgraph/provenance/namespaces.py`**
|
|
|
|
- Replace `TG_REIFIES` with `TG_CONTAINS`
|
|
|
|
### Not in Scope
|
|
|
|
- **kg-extract-topics**: older-style processor, not currently used in
|
|
standard flows
|
|
- **kg-extract-rows**: produces rows not triples, different provenance
|
|
model
|
|
- **Query-time provenance** (`urn:graph:retrieval`): separate concern,
|
|
already uses a different pattern (question/exploration/focus/synthesis)
|
|
- **Document/page/chunk provenance** (PDF decoder, chunker): already uses
|
|
`derived_entity_triples()` which is per-entity, not per-triple — no
|
|
redundancy issue
|
|
|
|
## Implementation Notes
|
|
|
|
### Processor Loop Restructure
|
|
|
|
Before (per-triple, in relationships):
|
|
```python
|
|
for rel in rels:
|
|
# ... build relationship_triple ...
|
|
stmt_uri = statement_uri()
|
|
prov_triples = triple_provenance_triples(
|
|
stmt_uri=stmt_uri,
|
|
extracted_triple=relationship_triple,
|
|
...
|
|
)
|
|
triples.extend(set_graph(prov_triples, GRAPH_SOURCE))
|
|
```
|
|
|
|
After (subgraph):
|
|
```python
|
|
sg_uri = subgraph_uri()
|
|
|
|
for rel in rels:
|
|
# ... build relationship_triple ...
|
|
extracted_triples.append(relationship_triple)
|
|
|
|
prov_triples = subgraph_provenance_triples(
|
|
subgraph_uri=sg_uri,
|
|
extracted_triples=extracted_triples,
|
|
chunk_uri=chunk_uri,
|
|
component_name=default_ident,
|
|
component_version=COMPONENT_VERSION,
|
|
llm_model=llm_model,
|
|
ontology_uri=ontology_uri,
|
|
)
|
|
triples.extend(set_graph(prov_triples, GRAPH_SOURCE))
|
|
```
|
|
|
|
### New Helper Signature
|
|
|
|
```python
|
|
def subgraph_provenance_triples(
|
|
subgraph_uri: str,
|
|
extracted_triples: List[Triple],
|
|
chunk_uri: str,
|
|
component_name: str,
|
|
component_version: str,
|
|
llm_model: Optional[str] = None,
|
|
ontology_uri: Optional[str] = None,
|
|
timestamp: Optional[str] = None,
|
|
) -> List[Triple]:
|
|
"""
|
|
Build provenance triples for a subgraph of extracted knowledge.
|
|
|
|
Creates:
|
|
- tg:contains link for each extracted triple (RDF-star quoted)
|
|
- One prov:wasDerivedFrom link to source chunk
|
|
- One activity with agent metadata
|
|
"""
|
|
```
|
|
|
|
### Breaking Change
|
|
|
|
This is a breaking change to the provenance model. Provenance has not
|
|
been released, so no migration is needed. The old `tg:reifies` /
|
|
`statement_uri` code can be removed outright.
|
|
|
|
## Vocabulary Reference
|
|
|
|
The full OWL ontology covering all extraction and query-time classes and predicates is at `specs/ontology/trustgraph.ttl`.
|