Subgraph provenance (#694)

Replace per-triple provenance reification with subgraph model Extraction provenance previously created a full reification (statement URI, activity, agent) for every single extracted triple, producing ~13 provenance triples per knowledge triple. Since each chunk is processed by a single LLM call, this was both redundant and semantically inaccurate. Now one subgraph object is created per chunk extraction, with tg:contains linking to each extracted triple. For 20 extractions from a chunk this reduces provenance from ~260 triples to ~33. - Rename tg:reifies -> tg:contains, stmt_uri -> subgraph_uri - Replace triple_provenance_triples() with subgraph_provenance_triples() - Refactor kg-extract-definitions and kg-extract-relationships to generate provenance once per chunk instead of per triple - Add subgraph provenance to kg-extract-ontology and kg-extract-agent (previously had none) - Update CLI tools and tech specs to match Also rename tg-show-document-hierarchy to tg-show-extraction-provenance. Added extra typing for extraction provenance, fixed extraction prov CLI
2026-04-25 00:16:23 +02:00 · 2026-03-13 11:37:59 +00:00 · 2026-03-13 11:37:59 +00:00 · 64e3f6bd0d
commit 64e3f6bd0d
parent 35128ff019
20 changed files with 463 additions and 193 deletions
--- a/docs/tech-specs/extraction-provenance-subgraph.md
+++ b/docs/tech-specs/extraction-provenance-subgraph.md
@ -0,0 +1,205 @@
+# Extraction Provenance: Subgraph Model
+
+## Problem
+
+Extraction-time provenance currently generates a full reification per
+extracted triple: a unique `stmt_uri`, `activity_uri`, and associated
+PROV-O metadata for every single knowledge fact.  Processing one chunk
+that yields 20 relationships produces ~220 provenance triples on top of
+the ~20 knowledge triples — a roughly 10:1 overhead.
+
+This is both expensive (storage, indexing, transmission) and semantically
+inaccurate.  Each chunk is processed by a single LLM call that produces
+all its triples in one transaction.  The current per-triple model
+obscures that by creating the illusion of 20 independent extraction
+events.
+
+Additionally, two of the four extraction processors (kg-extract-ontology,
+kg-extract-agent) have no provenance at all, leaving gaps in the audit
+trail.
+
+## Solution
+
+Replace per-triple reification with a **subgraph model**: one provenance
+record per chunk extraction, shared across all triples produced from that
+chunk.
+
+### Terminology Change
+
+| Old | New |
+|-----|-----|
+| `stmt_uri` (`https://trustgraph.ai/stmt/{uuid}`) | `subgraph_uri` (`https://trustgraph.ai/subgraph/{uuid}`) |
+| `statement_uri()` | `subgraph_uri()` |
+| `tg:reifies` (1:1, identity) | `tg:contains` (1:many, containment) |
+
+### Target Structure
+
+All provenance triples go in the `urn:graph:source` named graph.
+
+```
+# Subgraph contains each extracted triple (RDF-star quoted triples)
+<subgraph> tg:contains <<s1 p1 o1>> .
+<subgraph> tg:contains <<s2 p2 o2>> .
+<subgraph> tg:contains <<s3 p3 o3>> .
+
+# Derivation from source chunk
+<subgraph> prov:wasDerivedFrom <chunk_uri> .
+<subgraph> prov:wasGeneratedBy <activity> .
+
+# Activity: one per chunk extraction
+<activity> rdf:type          prov:Activity .
+<activity> rdfs:label        "{component_name} extraction" .
+<activity> prov:used         <chunk_uri> .
+<activity> prov:wasAssociatedWith <agent> .
+<activity> prov:startedAtTime "2026-03-13T10:00:00Z" .
+<activity> tg:componentVersion "0.25.0" .
+<activity> tg:llmModel       "gpt-4" .          # if available
+<activity> tg:ontology        <ontology_uri> .   # if available
+
+# Agent: stable per component
+<agent> rdf:type   prov:Agent .
+<agent> rdfs:label "{component_name}" .
+```
+
+### Volume Comparison
+
+For a chunk producing N extracted triples:
+
+| | Old (per-triple) | New (subgraph) |
+|---|---|---|
+| `tg:contains` / `tg:reifies` | N | N |
+| Activity triples | ~9 x N | ~9 |
+| Agent triples | 2 x N | 2 |
+| Statement/subgraph metadata | 2 x N | 2 |
+| **Total provenance triples** | **~13N** | **N + 13** |
+| **Example (N=20)** | **~260** | **33** |
+
+## Scope
+
+### Processors to Update (existing provenance, per-triple)
+
+**kg-extract-definitions**
+(`trustgraph-flow/trustgraph/extract/kg/definitions/extract.py`)
+
+Currently calls `statement_uri()` + `triple_provenance_triples()` inside
+the per-definition loop.
+
+Changes:
+- Move `subgraph_uri()` and `activity_uri()` creation before the loop
+- Collect `tg:contains` triples inside the loop
+- Emit shared activity/agent/derivation block once after the loop
+
+**kg-extract-relationships**
+(`trustgraph-flow/trustgraph/extract/kg/relationships/extract.py`)
+
+Same pattern as definitions.  Same changes.
+
+### Processors to Add Provenance (currently missing)
+
+**kg-extract-ontology**
+(`trustgraph-flow/trustgraph/extract/kg/ontology/extract.py`)
+
+Currently emits triples with no provenance.  Add subgraph provenance
+using the same pattern: one subgraph per chunk, `tg:contains` for each
+extracted triple.
+
+**kg-extract-agent**
+(`trustgraph-flow/trustgraph/extract/kg/agent/extract.py`)
+
+Currently emits triples with no provenance.  Add subgraph provenance
+using the same pattern.
+
+### Shared Provenance Library Changes
+
+**`trustgraph-base/trustgraph/provenance/triples.py`**
+
+- Replace `triple_provenance_triples()` with `subgraph_provenance_triples()`
+- New function accepts a list of extracted triples instead of a single one
+- Generates one `tg:contains` per triple, shared activity/agent block
+- Remove old `triple_provenance_triples()`
+
+**`trustgraph-base/trustgraph/provenance/uris.py`**
+
+- Replace `statement_uri()` with `subgraph_uri()`
+
+**`trustgraph-base/trustgraph/provenance/namespaces.py`**
+
+- Replace `TG_REIFIES` with `TG_CONTAINS`
+
+### Not in Scope
+
+- **kg-extract-topics**: older-style processor, not currently used in
+  standard flows
+- **kg-extract-rows**: produces rows not triples, different provenance
+  model
+- **Query-time provenance** (`urn:graph:retrieval`): separate concern,
+  already uses a different pattern (question/exploration/focus/synthesis)
+- **Document/page/chunk provenance** (PDF decoder, chunker): already uses
+  `derived_entity_triples()` which is per-entity, not per-triple — no
+  redundancy issue
+
+## Implementation Notes
+
+### Processor Loop Restructure
+
+Before (per-triple, in relationships):
+```python
+for rel in rels:
+    # ... build relationship_triple ...
+    stmt_uri = statement_uri()
+    prov_triples = triple_provenance_triples(
+        stmt_uri=stmt_uri,
+        extracted_triple=relationship_triple,
+        ...
+    )
+    triples.extend(set_graph(prov_triples, GRAPH_SOURCE))
+```
+
+After (subgraph):
+```python
+sg_uri = subgraph_uri()
+
+for rel in rels:
+    # ... build relationship_triple ...
+    extracted_triples.append(relationship_triple)
+
+prov_triples = subgraph_provenance_triples(
+    subgraph_uri=sg_uri,
+    extracted_triples=extracted_triples,
+    chunk_uri=chunk_uri,
+    component_name=default_ident,
+    component_version=COMPONENT_VERSION,
+    llm_model=llm_model,
+    ontology_uri=ontology_uri,
+)
+triples.extend(set_graph(prov_triples, GRAPH_SOURCE))
+```
+
+### New Helper Signature
+
+```python
+def subgraph_provenance_triples(
+    subgraph_uri: str,
+    extracted_triples: List[Triple],
+    chunk_uri: str,
+    component_name: str,
+    component_version: str,
+    llm_model: Optional[str] = None,
+    ontology_uri: Optional[str] = None,
+    timestamp: Optional[str] = None,
+) -> List[Triple]:
+    """
+    Build provenance triples for a subgraph of extracted knowledge.
+
+    Creates:
+    - tg:contains link for each extracted triple (RDF-star quoted)
+    - One prov:wasDerivedFrom link to source chunk
+    - One activity with agent metadata
+    """
+```
+
+### Breaking Change
+
+This is a breaking change to the provenance model.  Provenance has not
+been released, so no migration is needed.  The old `tg:reifies` /
+`statement_uri` code can be removed outright.
--- a/docs/tech-specs/extraction-time-provenance.md
+++ b/docs/tech-specs/extraction-time-provenance.md
@ -311,10 +311,10 @@ activity:chunk-789 tg:chunkOverlap 200 .
 # The extracted triple (edge)
 entity:JohnSmith rel:worksAt entity:AcmeCorp .

-# Statement object pointing at the edge (RDF 1.2 reification)
-stmt:001 tg:reifies <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
-stmt:001 prov:wasDerivedFrom chunk:123-1-1 .
-stmt:001 prov:wasGeneratedBy activity:extract-999 .
+# Subgraph containing the extracted triples
+subgraph:001 tg:contains <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
+subgraph:001 prov:wasDerivedFrom chunk:123-1-1 .
+subgraph:001 prov:wasGeneratedBy activity:extract-999 .

 activity:extract-999 a prov:Activity .
 activity:extract-999 prov:used chunk:123-1-1 .
@ -344,7 +344,7 @@ Custom predicates under the `tg:` namespace for extraction-specific metadata:

 | Predicate | Domain | Description |
 |-----------|--------|-------------|
-| `tg:reifies` | Statement | Points at the triple this statement object represents |
+| `tg:contains` | Subgraph | Points at a triple contained in this extraction subgraph |
 | `tg:pageCount` | Document | Total number of pages in source document |
 | `tg:mimeType` | Document | MIME type of source document |
 | `tg:pageNumber` | Page | Page number in source document |
@ -383,7 +383,7 @@ prov:startedAtTime rdfs:label "started at" .

 **TrustGraph Predicates:**
 ```
-tg:reifies rdfs:label "reifies" .
+tg:contains rdfs:label "contains" .
 tg:pageCount rdfs:label "page count" .
 tg:mimeType rdfs:label "MIME type" .
 tg:pageNumber rdfs:label "page number" .
@ -416,20 +416,20 @@ For finer-grained provenance, it would be valuable to record exactly where withi
 # The extracted triple
 entity:JohnSmith rel:worksAt entity:AcmeCorp .

-# Statement with sub-chunk provenance
-stmt:001 tg:reifies <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
-stmt:001 prov:wasDerivedFrom chunk:123-1-1 .
-stmt:001 tg:sourceText "John Smith has worked at Acme Corp since 2019" .
-stmt:001 tg:sourceCharOffset 1547 .
-stmt:001 tg:sourceCharLength 46 .
+# Subgraph with sub-chunk provenance
+subgraph:001 tg:contains <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
+subgraph:001 prov:wasDerivedFrom chunk:123-1-1 .
+subgraph:001 tg:sourceText "John Smith has worked at Acme Corp since 2019" .
+subgraph:001 tg:sourceCharOffset 1547 .
+subgraph:001 tg:sourceCharLength 46 .
 ```

 **Example with text range (alternative):**
 ```
-stmt:001 tg:reifies <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
-stmt:001 prov:wasDerivedFrom chunk:123-1-1 .
-stmt:001 tg:sourceRange "1547-1593" .
-stmt:001 tg:sourceText "John Smith has worked at Acme Corp since 2019" .
+subgraph:001 tg:contains <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
+subgraph:001 prov:wasDerivedFrom chunk:123-1-1 .
+subgraph:001 tg:sourceRange "1547-1593" .
+subgraph:001 tg:sourceText "John Smith has worked at Acme Corp since 2019" .
 ```

 **Implementation considerations:**
--- a/docs/tech-specs/query-time-explainability.md
+++ b/docs/tech-specs/query-time-explainability.md
@ -193,7 +193,7 @@ When storing explainability data, URIs from `uri_map` are used.

 Selected edges can be traced back to source documents:

-1. Query for reifying statement: `?stmt tg:reifies <<s p o>>`
+1. Query for containing subgraph: `?subgraph tg:contains <<s p o>>`
 2. Follow `prov:wasDerivedFrom` chain to root document
 3. Each step in chain: chunk → page → document

@ -209,7 +209,7 @@ elif term.type == TRIPLE:

 This enables queries like:
 ```
-?stmt tg:reifies <<http://example.org/s http://example.org/p "value">>
+?subgraph tg:contains <<http://example.org/s http://example.org/p "value">>
 ```

 ## CLI Usage