This document captures notes on extraction-time provenance for future specification work. Extraction-time provenance records the "source layer" - where data came from originally, how it was extracted and transformed.
This is separate from query-time provenance (see `query-time-provenance.md`) which records agent reasoning.
- Document metadata is stored as RDF triples in the knowledge graph
- A document ID ties metadata to the document, so the document appears as a node in the graph
- When edges (relationships/facts) are extracted from documents, a `subjectOf` relationship links the extracted edge back to the source document
### Problems with Current Approach
1.**Repetitive metadata loading:** Document metadata is bundled and loaded repeatedly with every batch of triples extracted from that document. This is wasteful and redundant - the same metadata travels as cargo with every extraction output.
2.**Shallow provenance:** The current `subjectOf` relationship only links facts directly to the top-level document. There is no visibility into the transformation chain - which page the fact came from, which chunk, what extraction method was used.
### Desired State
1.**Load metadata once:** Document metadata should be loaded once and attached to the top-level document node, not repeated with every triple batch.
2.**Rich provenance DAG:** Capture the full transformation chain from source document through all intermediate artifacts down to extracted facts. For example, a PDF document transformation:
```
PDF file (source document with metadata)
→ Page 1 (decoded text)
→ Chunk 1
→ Extracted edge/fact (via subjectOf)
→ Extracted edge/fact
→ Chunk 2
→ Extracted edge/fact
→ Page 2
→ Chunk 3
→ ...
```
3.**Unified storage:** The provenance DAG is stored in the same knowledge graph as the extracted knowledge. This allows provenance to be queried the same way as knowledge - following edges back up the chain from any fact to its exact source location.
4.**Stable IDs:** Each intermediate artifact (page, chunk) has a stable ID as a node in the graph.
5.**Parent-child linking:** Derived documents are linked to their parents all the way up to the top-level source document using consistent relationship types.
6.**Precise fact attribution:** The `subjectOf` relationship on extracted edges points to the immediate parent (chunk), not the top-level document. Full provenance is recovered by traversing up the DAG.
## Use Cases
### UC1: Source Attribution in GraphRAG Responses
**Scenario:** A user runs a GraphRAG query and receives a response from the agent.
**Flow:**
1. User submits a query to the GraphRAG agent
2. Agent retrieves relevant facts from the knowledge graph to formulate a response
3. Per the query-time provenance spec, the agent reports which facts contributed to the response
4. Each fact links to its source chunk via the provenance DAG
5. Chunks link to pages, pages link to source documents
**UX Outcome:** The interface displays the LLM response alongside source attribution. The user can:
- See which facts supported the response
- Drill down from facts → chunks → pages → documents
- Peruse the original source documents to verify claims
- Understand exactly where in a document (which page, which section) a fact originated
**Value:** Users can verify AI-generated responses against primary sources, building trust and enabling fact-checking.
### UC2: Debugging Extraction Quality
A fact looks wrong. Trace back through chunk → page → document to see the original text. Was it a bad extraction, or was the source itself wrong?
### UC3: Incremental Re-extraction
Source document gets updated. Which chunks/facts were derived from it? Invalidate and regenerate just those, rather than re-processing everything.
### UC4: Data Deletion / Right to be Forgotten
A source document must be removed (GDPR, legal, etc.). Traverse the DAG to find and remove all derived facts.
### UC5: Conflict Resolution
Two facts contradict each other. Trace both back to their sources to understand why and decide which to trust (more authoritative source, more recent, etc.).
### UC6: Source Authority Weighting
Some sources are more authoritative than others. Facts can be weighted or filtered based on the authority/quality of their origin documents.
### UC7: Extraction Pipeline Comparison
Compare outputs from different extraction methods/versions. Which extractor produced better facts from the same source?
## Integration Points
### Librarian
The librarian component already provides document storage with unique document IDs. The provenance system integrates with this existing infrastructure.
#### Existing Capabilities (already implemented)
**Parent-Child Document Linking:**
-`parent_id` field in `DocumentMetadata` - links child to parent document
-`document_type` field - values: `"source"` (original) or `"extracted"` (derived)
-`add-child-document` API - creates child document with automatic `document_type = "extracted"`
-`list-children` API - retrieves all children of a parent document
- Cascade deletion - removing a parent automatically deletes all child documents
**Document Identification:**
- Document IDs are client-specified (not auto-generated)
- Documents keyed by composite `(user, document_id)` in Cassandra
- Object IDs (UUIDs) generated internally for blob storage
**Metadata Support:**
-`metadata: list[Triple]` field - RDF triples for structured metadata
The design accommodates both because the chunker treats its input generically - it uses whatever document ID it receives as the parent, regardless of whether that's a source document or a page.
### Metadata Schema (PROV-O)
Provenance metadata uses the W3C PROV-O ontology. This provides a standard vocabulary and enables future signing/authentication of extraction outputs.
| model | Embedding model used | `text-embedding-ada-002` |
| component_version | TG embedder version | `1.0.0` |
The `entity` field links the embedding to the knowledge graph (node URI). The `chunk_id` field provides provenance back to the source chunk, enabling traversal up the DAG to the original document.
#### TrustGraph Namespace Extensions
Custom predicates under the `tg:` namespace for extraction-specific metadata:
| `tg:chunkOverlap` | Activity | Configured overlap between chunks |
| `tg:componentVersion` | Activity | Version of TG component |
| `tg:llmModel` | Activity | LLM used for extraction |
| `tg:ontology` | Activity | Ontology URI used to guide extraction |
| `tg:embeddingModel` | Activity | Model used for embeddings |
| `tg:sourceText` | Statement | Exact text from which a triple was extracted |
| `tg:sourceCharOffset` | Statement | Character offset within chunk where source text starts |
| `tg:sourceCharLength` | Statement | Length of source text in characters |
#### Vocabulary Bootstrap (Per Collection)
The knowledge graph is ontology-neutral and initialises empty. When writing PROV-O provenance data to a collection for the first time, the vocabulary must be bootstrapped with RDF labels for all classes and predicates. This ensures human-readable display in queries and UI.
tg:sourceCharOffset rdfs:label "source character offset" .
tg:sourceCharLength rdfs:label "source character length" .
```
**Implementation note:** This vocabulary bootstrap should be idempotent - safe to run multiple times without creating duplicates. Could be triggered on first document processing in a collection, or as a separate collection initialisation step.
#### Sub-Chunk Provenance (Aspirational)
For finer-grained provenance, it would be valuable to record exactly where within a chunk a triple was extracted from. This enables:
- Highlighting the exact source text in the UI
- Verifying extraction accuracy against source
- Debugging extraction quality at the sentence level
- LLM-based extraction may not naturally provide character positions
- Could prompt the LLM to return the source sentence/phrase alongside extracted triples
- Alternatively, post-process to fuzzy-match extracted entities back to source text
- Trade-off between extraction complexity and provenance granularity
- May be easier to achieve with structured extraction methods than free-form LLM extraction
This is marked as aspirational - the basic chunk-level provenance should be implemented first, with sub-chunk tracking as a future enhancement if feasible.
### Dual Storage Model
The provenance DAG is built progressively as documents flow through the pipeline:
Both stores maintain the same DAG structure. The librarian holds content; the graph holds relationships and enables traversal queries.
### Key Design Principles
1.**Document ID as the unit of flow** - Processors pass IDs, not content. Content is fetched from librarian when needed.
2.**Emit once at source** - Metadata is written to the graph once when processing begins, not repeated downstream.
3.**Consistent processor pattern** - Every processor follows the same receive/fetch/produce/save/emit/forward pattern.
4.**Progressive DAG construction** - Each processor adds its level to the DAG. The full provenance chain is built incrementally.
5.**Post-chunker optimization** - After chunking, messages carry both ID and content. Chunks are small (2-4KB), so including content avoids unnecessary librarian round-trips while preserving provenance via the ID.
## Implementation Tasks
### Librarian Changes
#### Current State
- Initiates document processing by sending document ID to first processor
- No connection to triple store - metadata is bundled with extraction outputs