trustgraph/docs/tech-specs/extraction-time-provenance.md
cybermaggedon 4d31cd4c03
Agent explainability tech specs (#655)
* Query time provenance tech spec

* Extraction provenance placeholder
2026-02-28 14:44:18 +00:00

49 lines
1.4 KiB
Markdown

# Extraction-Time Provenance: Source Layer
## Status
Notes - Not yet started
## Overview
This document captures notes on extraction-time provenance for future specification work. Extraction-time provenance records the "source layer" - where data came from originally, how it was extracted and transformed.
This is separate from query-time provenance (see `query-time-provenance.md`) which records agent reasoning.
## Current State
Source metadata is already partially stored in the knowledge graph (~40% solved):
- Documents have source URLs, timestamps
- Some extraction metadata exists
## Scope
Extraction-time provenance should capture:
### Source Layer (Origin)
- URL / file path
- Retrieval timestamp
- Funding sources
- Authorship / authority
- Document metadata (title, date, version)
### Transformation Layer (Extraction)
- Extraction tool used (e.g., PDF parser, table extractor)
- Extraction method / version
- Confidence scores
- Raw-to-structured mapping
- Parent-child relationships (PDF → table → row → fact)
## Key Questions for Future Spec
1. What metadata is already captured today?
2. What gaps exist?
3. How to structure the extraction DAG?
4. How does query-time provenance link to extraction-time nodes?
5. Storage format - RDF triples? Separate schema?
## References
- Query-time provenance: `docs/tech-specs/query-time-provenance.md`
- PROV-O standard for provenance modeling
- Existing source metadata in knowledge graph (needs audit)