trustgraph/docs/tech-specs/extraction-time-provenance.md
cybermaggedon 4d31cd4c03
Agent explainability tech specs (#655)
* Query time provenance tech spec

* Extraction provenance placeholder
2026-02-28 14:44:18 +00:00

1.4 KiB

Extraction-Time Provenance: Source Layer

Status

Notes - Not yet started

Overview

This document captures notes on extraction-time provenance for future specification work. Extraction-time provenance records the "source layer" - where data came from originally, how it was extracted and transformed.

This is separate from query-time provenance (see query-time-provenance.md) which records agent reasoning.

Current State

Source metadata is already partially stored in the knowledge graph (~40% solved):

  • Documents have source URLs, timestamps
  • Some extraction metadata exists

Scope

Extraction-time provenance should capture:

Source Layer (Origin)

  • URL / file path
  • Retrieval timestamp
  • Funding sources
  • Authorship / authority
  • Document metadata (title, date, version)

Transformation Layer (Extraction)

  • Extraction tool used (e.g., PDF parser, table extractor)
  • Extraction method / version
  • Confidence scores
  • Raw-to-structured mapping
  • Parent-child relationships (PDF → table → row → fact)

Key Questions for Future Spec

  1. What metadata is already captured today?
  2. What gaps exist?
  3. How to structure the extraction DAG?
  4. How does query-time provenance link to extraction-time nodes?
  5. Storage format - RDF triples? Separate schema?

References

  • Query-time provenance: docs/tech-specs/query-time-provenance.md
  • PROV-O standard for provenance modeling
  • Existing source metadata in knowledge graph (needs audit)