mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-26 00:46:22 +02:00
Subgraph provenance (#694)
Replace per-triple provenance reification with subgraph model Extraction provenance previously created a full reification (statement URI, activity, agent) for every single extracted triple, producing ~13 provenance triples per knowledge triple. Since each chunk is processed by a single LLM call, this was both redundant and semantically inaccurate. Now one subgraph object is created per chunk extraction, with tg:contains linking to each extracted triple. For 20 extractions from a chunk this reduces provenance from ~260 triples to ~33. - Rename tg:reifies -> tg:contains, stmt_uri -> subgraph_uri - Replace triple_provenance_triples() with subgraph_provenance_triples() - Refactor kg-extract-definitions and kg-extract-relationships to generate provenance once per chunk instead of per triple - Add subgraph provenance to kg-extract-ontology and kg-extract-agent (previously had none) - Update CLI tools and tech specs to match Also rename tg-show-document-hierarchy to tg-show-extraction-provenance. Added extra typing for extraction provenance, fixed extraction prov CLI
This commit is contained in:
parent
35128ff019
commit
64e3f6bd0d
20 changed files with 463 additions and 193 deletions
|
|
@ -311,10 +311,10 @@ activity:chunk-789 tg:chunkOverlap 200 .
|
|||
# The extracted triple (edge)
|
||||
entity:JohnSmith rel:worksAt entity:AcmeCorp .
|
||||
|
||||
# Statement object pointing at the edge (RDF 1.2 reification)
|
||||
stmt:001 tg:reifies <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
|
||||
stmt:001 prov:wasDerivedFrom chunk:123-1-1 .
|
||||
stmt:001 prov:wasGeneratedBy activity:extract-999 .
|
||||
# Subgraph containing the extracted triples
|
||||
subgraph:001 tg:contains <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
|
||||
subgraph:001 prov:wasDerivedFrom chunk:123-1-1 .
|
||||
subgraph:001 prov:wasGeneratedBy activity:extract-999 .
|
||||
|
||||
activity:extract-999 a prov:Activity .
|
||||
activity:extract-999 prov:used chunk:123-1-1 .
|
||||
|
|
@ -344,7 +344,7 @@ Custom predicates under the `tg:` namespace for extraction-specific metadata:
|
|||
|
||||
| Predicate | Domain | Description |
|
||||
|-----------|--------|-------------|
|
||||
| `tg:reifies` | Statement | Points at the triple this statement object represents |
|
||||
| `tg:contains` | Subgraph | Points at a triple contained in this extraction subgraph |
|
||||
| `tg:pageCount` | Document | Total number of pages in source document |
|
||||
| `tg:mimeType` | Document | MIME type of source document |
|
||||
| `tg:pageNumber` | Page | Page number in source document |
|
||||
|
|
@ -383,7 +383,7 @@ prov:startedAtTime rdfs:label "started at" .
|
|||
|
||||
**TrustGraph Predicates:**
|
||||
```
|
||||
tg:reifies rdfs:label "reifies" .
|
||||
tg:contains rdfs:label "contains" .
|
||||
tg:pageCount rdfs:label "page count" .
|
||||
tg:mimeType rdfs:label "MIME type" .
|
||||
tg:pageNumber rdfs:label "page number" .
|
||||
|
|
@ -416,20 +416,20 @@ For finer-grained provenance, it would be valuable to record exactly where withi
|
|||
# The extracted triple
|
||||
entity:JohnSmith rel:worksAt entity:AcmeCorp .
|
||||
|
||||
# Statement with sub-chunk provenance
|
||||
stmt:001 tg:reifies <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
|
||||
stmt:001 prov:wasDerivedFrom chunk:123-1-1 .
|
||||
stmt:001 tg:sourceText "John Smith has worked at Acme Corp since 2019" .
|
||||
stmt:001 tg:sourceCharOffset 1547 .
|
||||
stmt:001 tg:sourceCharLength 46 .
|
||||
# Subgraph with sub-chunk provenance
|
||||
subgraph:001 tg:contains <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
|
||||
subgraph:001 prov:wasDerivedFrom chunk:123-1-1 .
|
||||
subgraph:001 tg:sourceText "John Smith has worked at Acme Corp since 2019" .
|
||||
subgraph:001 tg:sourceCharOffset 1547 .
|
||||
subgraph:001 tg:sourceCharLength 46 .
|
||||
```
|
||||
|
||||
**Example with text range (alternative):**
|
||||
```
|
||||
stmt:001 tg:reifies <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
|
||||
stmt:001 prov:wasDerivedFrom chunk:123-1-1 .
|
||||
stmt:001 tg:sourceRange "1547-1593" .
|
||||
stmt:001 tg:sourceText "John Smith has worked at Acme Corp since 2019" .
|
||||
subgraph:001 tg:contains <<entity:JohnSmith rel:worksAt entity:AcmeCorp>> .
|
||||
subgraph:001 prov:wasDerivedFrom chunk:123-1-1 .
|
||||
subgraph:001 tg:sourceRange "1547-1593" .
|
||||
subgraph:001 tg:sourceText "John Smith has worked at Acme Corp since 2019" .
|
||||
```
|
||||
|
||||
**Implementation considerations:**
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue