Subgraph provenance (#694)

Replace per-triple provenance reification with subgraph model

Extraction provenance previously created a full reification (statement
URI, activity, agent) for every single extracted triple, producing ~13
provenance triples per knowledge triple.  Since each chunk is processed
by a single LLM call, this was both redundant and semantically
inaccurate.

Now one subgraph object is created per chunk extraction, with
tg:contains linking to each extracted triple.  For 20 extractions from
a chunk this reduces provenance from ~260 triples to ~33.

- Rename tg:reifies -> tg:contains, stmt_uri -> subgraph_uri
- Replace triple_provenance_triples() with subgraph_provenance_triples()
- Refactor kg-extract-definitions and kg-extract-relationships to
  generate provenance once per chunk instead of per triple
- Add subgraph provenance to kg-extract-ontology and kg-extract-agent
  (previously had none)
- Update CLI tools and tech specs to match

Also rename tg-show-document-hierarchy to tg-show-extraction-provenance.

Added extra typing for extraction provenance, fixed extraction prov CLI
This commit is contained in:
cybermaggedon 2026-03-13 11:37:59 +00:00 committed by GitHub
parent 35128ff019
commit 64e3f6bd0d
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
20 changed files with 463 additions and 193 deletions

View file

@ -193,7 +193,7 @@ When storing explainability data, URIs from `uri_map` are used.
Selected edges can be traced back to source documents:
1. Query for reifying statement: `?stmt tg:reifies <<s p o>>`
1. Query for containing subgraph: `?subgraph tg:contains <<s p o>>`
2. Follow `prov:wasDerivedFrom` chain to root document
3. Each step in chain: chunk → page → document
@ -209,7 +209,7 @@ elif term.type == TRIPLE:
This enables queries like:
```
?stmt tg:reifies <<http://example.org/s http://example.org/p "value">>
?subgraph tg:contains <<http://example.org/s http://example.org/p "value">>
```
## CLI Usage