mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-05-25 23:35:12 +02:00
Update tech spec
This commit is contained in:
parent
7a6197d8c3
commit
34968c801b
2 changed files with 263 additions and 282 deletions
263
docs/tech-specs/query-time-explainability.md
Normal file
263
docs/tech-specs/query-time-explainability.md
Normal file
|
|
@ -0,0 +1,263 @@
|
||||||
|
# Query-Time Explainability
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Implemented
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This specification describes how GraphRAG records and communicates explainability data during query execution. The goal is full traceability: from final answer back through selected edges to source documents.
|
||||||
|
|
||||||
|
Query-time explainability captures what the GraphRAG pipeline did during reasoning. It connects to extraction-time provenance which records where knowledge graph facts originated.
|
||||||
|
|
||||||
|
## Terminology
|
||||||
|
|
||||||
|
| Term | Definition |
|
||||||
|
|------|------------|
|
||||||
|
| **Explainability** | The record of how a result was derived |
|
||||||
|
| **Session** | A single GraphRAG query execution |
|
||||||
|
| **Edge Selection** | LLM-driven selection of relevant edges with reasoning |
|
||||||
|
| **Provenance Chain** | Path from edge → chunk → page → document |
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
### Explainability Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
GraphRAG Query
|
||||||
|
│
|
||||||
|
├─► Session Activity
|
||||||
|
│ └─► Query text, timestamp
|
||||||
|
│
|
||||||
|
├─► Retrieval Entity
|
||||||
|
│ └─► All edges retrieved from subgraph
|
||||||
|
│
|
||||||
|
├─► Selection Entity
|
||||||
|
│ └─► Selected edges with LLM reasoning
|
||||||
|
│ └─► Each edge links to extraction provenance
|
||||||
|
│
|
||||||
|
└─► Answer Entity
|
||||||
|
└─► Reference to synthesized response (in librarian)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Two-Stage GraphRAG Pipeline
|
||||||
|
|
||||||
|
1. **Edge Selection**: LLM selects relevant edges from subgraph, providing reasoning for each
|
||||||
|
2. **Synthesis**: LLM generates answer from selected edges only
|
||||||
|
|
||||||
|
This separation enables explainability - we know exactly which edges contributed.
|
||||||
|
|
||||||
|
### Storage
|
||||||
|
|
||||||
|
- Explainability triples stored in configurable collection (default: `explainability`)
|
||||||
|
- Uses PROV-O ontology for provenance relationships
|
||||||
|
- RDF-star reification for edge references
|
||||||
|
- Answer content stored in librarian service (not inline - too large)
|
||||||
|
|
||||||
|
### Real-Time Streaming
|
||||||
|
|
||||||
|
Explainability events stream to client as the query executes:
|
||||||
|
|
||||||
|
1. Session created → event emitted
|
||||||
|
2. Edges retrieved → event emitted
|
||||||
|
3. Edges selected with reasoning → event emitted
|
||||||
|
4. Answer synthesized → event emitted
|
||||||
|
|
||||||
|
Client receives `explain_id` and `explain_collection` to fetch full details.
|
||||||
|
|
||||||
|
## URI Structure
|
||||||
|
|
||||||
|
All URIs use the `urn:trustgraph:` namespace with UUIDs:
|
||||||
|
|
||||||
|
| Entity | URI Pattern |
|
||||||
|
|--------|-------------|
|
||||||
|
| Session | `urn:trustgraph:session:{uuid}` |
|
||||||
|
| Retrieval | `urn:trustgraph:prov:retrieval:{uuid}` |
|
||||||
|
| Selection | `urn:trustgraph:prov:selection:{uuid}` |
|
||||||
|
| Answer | `urn:trustgraph:prov:answer:{uuid}` |
|
||||||
|
| Edge Selection | `urn:trustgraph:prov:edge:{uuid}:{index}` |
|
||||||
|
|
||||||
|
## RDF Model (PROV-O)
|
||||||
|
|
||||||
|
### Session Activity
|
||||||
|
|
||||||
|
```turtle
|
||||||
|
<session-uri> a prov:Activity ;
|
||||||
|
rdfs:label "GraphRAG query session" ;
|
||||||
|
prov:startedAtTime "2024-01-15T10:30:00Z" ;
|
||||||
|
tg:query "What was the War on Terror?" .
|
||||||
|
```
|
||||||
|
|
||||||
|
### Retrieval Entity
|
||||||
|
|
||||||
|
```turtle
|
||||||
|
<retrieval-uri> a prov:Entity ;
|
||||||
|
rdfs:label "Retrieved edges" ;
|
||||||
|
prov:wasGeneratedBy <session-uri> ;
|
||||||
|
tg:edgeCount 50 .
|
||||||
|
```
|
||||||
|
|
||||||
|
### Selection Entity
|
||||||
|
|
||||||
|
```turtle
|
||||||
|
<selection-uri> a prov:Entity ;
|
||||||
|
rdfs:label "Selected edges" ;
|
||||||
|
prov:wasDerivedFrom <retrieval-uri> ;
|
||||||
|
tg:selectedEdge <edge-sel-0> ;
|
||||||
|
tg:selectedEdge <edge-sel-1> .
|
||||||
|
|
||||||
|
<edge-sel-0> tg:edge << <s> <p> <o> >> ;
|
||||||
|
tg:reasoning "This edge establishes the key relationship..." .
|
||||||
|
```
|
||||||
|
|
||||||
|
### Answer Entity
|
||||||
|
|
||||||
|
```turtle
|
||||||
|
<answer-uri> a prov:Entity ;
|
||||||
|
rdfs:label "GraphRAG answer" ;
|
||||||
|
prov:wasDerivedFrom <selection-uri> ;
|
||||||
|
tg:document <urn:trustgraph:answer:{uuid}> .
|
||||||
|
```
|
||||||
|
|
||||||
|
The `tg:document` references the answer stored in the librarian service.
|
||||||
|
|
||||||
|
## Namespace Constants
|
||||||
|
|
||||||
|
Defined in `trustgraph-base/trustgraph/provenance/namespaces.py`:
|
||||||
|
|
||||||
|
| Constant | URI |
|
||||||
|
|----------|-----|
|
||||||
|
| `TG_QUERY` | `https://trustgraph.ai/ns/query` |
|
||||||
|
| `TG_EDGE_COUNT` | `https://trustgraph.ai/ns/edgeCount` |
|
||||||
|
| `TG_SELECTED_EDGE` | `https://trustgraph.ai/ns/selectedEdge` |
|
||||||
|
| `TG_EDGE` | `https://trustgraph.ai/ns/edge` |
|
||||||
|
| `TG_REASONING` | `https://trustgraph.ai/ns/reasoning` |
|
||||||
|
| `TG_CONTENT` | `https://trustgraph.ai/ns/content` |
|
||||||
|
| `TG_DOCUMENT` | `https://trustgraph.ai/ns/document` |
|
||||||
|
|
||||||
|
## GraphRagResponse Schema
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class GraphRagResponse:
|
||||||
|
error: Error | None = None
|
||||||
|
response: str = ""
|
||||||
|
end_of_stream: bool = False
|
||||||
|
explain_id: str | None = None
|
||||||
|
explain_collection: str | None = None
|
||||||
|
message_type: str = "" # "chunk" or "explain"
|
||||||
|
end_of_session: bool = False
|
||||||
|
```
|
||||||
|
|
||||||
|
### Message Types
|
||||||
|
|
||||||
|
| message_type | Purpose |
|
||||||
|
|--------------|---------|
|
||||||
|
| `chunk` | Response text (streaming or final) |
|
||||||
|
| `explain` | Explainability event with IRI reference |
|
||||||
|
|
||||||
|
### Session Lifecycle
|
||||||
|
|
||||||
|
1. Multiple `explain` messages (session, retrieval, selection, answer)
|
||||||
|
2. Multiple `chunk` messages (streaming response)
|
||||||
|
3. Final `chunk` with `end_of_session=True`
|
||||||
|
|
||||||
|
## Edge Selection Format
|
||||||
|
|
||||||
|
LLM returns JSONL with selected edges:
|
||||||
|
|
||||||
|
```jsonl
|
||||||
|
{"id": "edge-hash-1", "reasoning": "This edge shows the key relationship..."}
|
||||||
|
{"id": "edge-hash-2", "reasoning": "Provides supporting evidence..."}
|
||||||
|
```
|
||||||
|
|
||||||
|
The `id` is a hash of `(labeled_s, labeled_p, labeled_o)` computed by `edge_id()`.
|
||||||
|
|
||||||
|
## URI Preservation
|
||||||
|
|
||||||
|
### The Problem
|
||||||
|
|
||||||
|
GraphRAG displays human-readable labels to the LLM, but explainability needs original URIs for provenance tracing.
|
||||||
|
|
||||||
|
### Solution
|
||||||
|
|
||||||
|
`get_labelgraph()` returns both:
|
||||||
|
- `labeled_edges`: List of `(label_s, label_p, label_o)` for LLM
|
||||||
|
- `uri_map`: Dict mapping `edge_id(labels)` → `(uri_s, uri_p, uri_o)`
|
||||||
|
|
||||||
|
When storing explainability data, URIs from `uri_map` are used.
|
||||||
|
|
||||||
|
## Provenance Tracing
|
||||||
|
|
||||||
|
### From Edge to Source
|
||||||
|
|
||||||
|
Selected edges can be traced back to source documents:
|
||||||
|
|
||||||
|
1. Query for reifying statement: `?stmt tg:reifies <<s p o>>`
|
||||||
|
2. Follow `prov:wasDerivedFrom` chain to root document
|
||||||
|
3. Each step in chain: chunk → page → document
|
||||||
|
|
||||||
|
### Cassandra Quoted Triple Support
|
||||||
|
|
||||||
|
The Cassandra query service supports matching quoted triples:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# In get_term_value():
|
||||||
|
elif term.type == TRIPLE:
|
||||||
|
return serialize_triple(term.triple)
|
||||||
|
```
|
||||||
|
|
||||||
|
This enables queries like:
|
||||||
|
```
|
||||||
|
?stmt tg:reifies <<http://example.org/s http://example.org/p "value">>
|
||||||
|
```
|
||||||
|
|
||||||
|
## CLI Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
tg-invoke-graph-rag --explainable -q "What was the War on Terror?"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Output Format
|
||||||
|
|
||||||
|
```
|
||||||
|
[session] urn:trustgraph:session:abc123
|
||||||
|
|
||||||
|
[retrieval] urn:trustgraph:prov:retrieval:abc123
|
||||||
|
|
||||||
|
[selection] urn:trustgraph:prov:selection:abc123
|
||||||
|
Selected 12 edge(s)
|
||||||
|
Edge: (Guantanamo, definition, A detention facility...)
|
||||||
|
Reason: Directly connects Guantanamo to the War on Terror
|
||||||
|
Source: Chunk 1 → Page 2 → Beyond the Vigilant State
|
||||||
|
|
||||||
|
[answer] urn:trustgraph:prov:answer:abc123
|
||||||
|
|
||||||
|
Based on the provided knowledge statements...
|
||||||
|
```
|
||||||
|
|
||||||
|
### Features
|
||||||
|
|
||||||
|
- Real-time explainability events during query
|
||||||
|
- Label resolution for edge components via `rdfs:label`
|
||||||
|
- Source chain tracing via `prov:wasDerivedFrom`
|
||||||
|
- Label caching to avoid repeated queries
|
||||||
|
|
||||||
|
## Files Implemented
|
||||||
|
|
||||||
|
| File | Purpose |
|
||||||
|
|------|---------|
|
||||||
|
| `trustgraph-base/trustgraph/provenance/uris.py` | URI generators |
|
||||||
|
| `trustgraph-base/trustgraph/provenance/namespaces.py` | RDF namespace constants |
|
||||||
|
| `trustgraph-base/trustgraph/provenance/triples.py` | Triple builders |
|
||||||
|
| `trustgraph-base/trustgraph/schema/services/retrieval.py` | GraphRagResponse schema |
|
||||||
|
| `trustgraph-flow/trustgraph/retrieval/graph_rag/graph_rag.py` | Core GraphRAG with URI preservation |
|
||||||
|
| `trustgraph-flow/trustgraph/retrieval/graph_rag/rag.py` | Service with librarian integration |
|
||||||
|
| `trustgraph-flow/trustgraph/query/triples/cassandra/service.py` | Quoted triple query support |
|
||||||
|
| `trustgraph-cli/trustgraph/cli/invoke_graph_rag.py` | CLI with explainability display |
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- PROV-O (W3C Provenance Ontology): https://www.w3.org/TR/prov-o/
|
||||||
|
- RDF-star: https://w3c.github.io/rdf-star/
|
||||||
|
- Extraction-time provenance: `docs/tech-specs/extraction-time-provenance.md`
|
||||||
|
|
@ -1,282 +0,0 @@
|
||||||
# Query-Time Provenance: Agent Explainability
|
|
||||||
|
|
||||||
## Status
|
|
||||||
|
|
||||||
Draft - Gathering Requirements
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
This specification defines how the agent framework records and communicates provenance during query execution. The goal is full explainability: tracing how a result was obtained, from final answer back through reasoning steps to source data.
|
|
||||||
|
|
||||||
Query-time provenance captures the "inference layer" - what the agent did during reasoning. It connects to extraction-time provenance (source layer) which records where facts came from originally.
|
|
||||||
|
|
||||||
## Terminology
|
|
||||||
|
|
||||||
| Term | Definition |
|
|
||||||
|------|------------|
|
|
||||||
| **Provenance** | The record of how a result was derived |
|
|
||||||
| **Provenance Node** | A single step or artifact in the provenance DAG |
|
|
||||||
| **Provenance DAG** | Directed Acyclic Graph of provenance relationships |
|
|
||||||
| **Query-time Provenance** | Provenance generated during agent reasoning |
|
|
||||||
| **Extraction-time Provenance** | Provenance from data ingestion (source metadata) - separate spec |
|
|
||||||
|
|
||||||
## Architecture
|
|
||||||
|
|
||||||
### Two Provenance Contexts
|
|
||||||
|
|
||||||
1. **Extraction-time** (out of scope for this spec):
|
|
||||||
- Generated when data is ingested (PDF extraction, web scraping, etc.)
|
|
||||||
- Records: source URL, extraction method, timestamps, funding, authorship
|
|
||||||
- Already partially implemented via source metadata in knowledge graph
|
|
||||||
- See: `docs/tech-specs/extraction-time-provenance.md` (notes)
|
|
||||||
|
|
||||||
2. **Query-time** (this spec):
|
|
||||||
- Generated during agent reasoning
|
|
||||||
- Records: tool invocations, retrieval results, LLM reasoning, final conclusions
|
|
||||||
- Links to extraction-time provenance for retrieved facts
|
|
||||||
|
|
||||||
### Provenance Flow
|
|
||||||
|
|
||||||
```
|
|
||||||
Agent Session
|
|
||||||
│
|
|
||||||
├─► Tool: Knowledge Query
|
|
||||||
│ │
|
|
||||||
│ ├─► Retrieved Fact A ──► [link to extraction provenance]
|
|
||||||
│ └─► Retrieved Fact B ──► [link to extraction provenance]
|
|
||||||
│
|
|
||||||
├─► LLM Reasoning Step
|
|
||||||
│ │
|
|
||||||
│ └─► "Combined A and B to conclude X"
|
|
||||||
│
|
|
||||||
└─► Final Answer
|
|
||||||
│
|
|
||||||
└─► Derived from reasoning step above
|
|
||||||
```
|
|
||||||
|
|
||||||
### Storage
|
|
||||||
|
|
||||||
- Provenance stored in knowledge graph infrastructure
|
|
||||||
- Segregated in a **separate collection** for distinct retrieval patterns
|
|
||||||
- Query-time provenance references extraction-time provenance nodes via IRIs
|
|
||||||
- Persists beyond agent session (reusable, auditable)
|
|
||||||
|
|
||||||
### Real-Time Streaming
|
|
||||||
|
|
||||||
Provenance events stream back to the client as the agent works:
|
|
||||||
|
|
||||||
1. Agent invokes tool
|
|
||||||
2. Tool generates provenance data
|
|
||||||
3. Provenance stored in graph
|
|
||||||
4. Provenance event sent to client
|
|
||||||
5. UX builds provenance visualization incrementally
|
|
||||||
|
|
||||||
## Provenance Node Structure
|
|
||||||
|
|
||||||
Each provenance node represents a step in the reasoning process.
|
|
||||||
|
|
||||||
### Node Identity
|
|
||||||
|
|
||||||
Provenance nodes are identified by IRIs containing UUIDs, consistent with the RDF-style knowledge graph:
|
|
||||||
|
|
||||||
```
|
|
||||||
urn:trustgraph:prov:550e8400-e29b-41d4-a716-446655440000
|
|
||||||
```
|
|
||||||
|
|
||||||
### Core Fields
|
|
||||||
|
|
||||||
| Field | Description |
|
|
||||||
|-------|-------------|
|
|
||||||
| `id` | IRI with UUID (e.g., `urn:trustgraph:prov:{uuid}`) |
|
|
||||||
| `session_id` | Agent session this belongs to |
|
|
||||||
| `timestamp` | When this step occurred |
|
|
||||||
| `type` | Node type (see below) |
|
|
||||||
| `derived_from` | List of parent node IRIs (DAG edges) |
|
|
||||||
|
|
||||||
### Node Types
|
|
||||||
|
|
||||||
| Type | Description | Additional Fields |
|
|
||||||
|------|-------------|-------------------|
|
|
||||||
| `retrieval` | Facts retrieved from knowledge graph | `facts`, `source_refs` |
|
|
||||||
| `tool_invocation` | Tool was called | `tool_name`, `input`, `output` |
|
|
||||||
| `reasoning` | LLM reasoning step | `prompt_summary`, `conclusion` |
|
|
||||||
| `answer` | Final answer produced | `content` |
|
|
||||||
|
|
||||||
### Example Provenance Nodes
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"id": "urn:trustgraph:prov:550e8400-e29b-41d4-a716-446655440001",
|
|
||||||
"session_id": "urn:trustgraph:session:7c9e6679-7425-40de-944b-e07fc1f90ae7",
|
|
||||||
"timestamp": "2024-01-15T10:30:00Z",
|
|
||||||
"type": "retrieval",
|
|
||||||
"derived_from": [],
|
|
||||||
"facts": [
|
|
||||||
{
|
|
||||||
"id": "urn:trustgraph:fact:9b1deb4d-3b7d-4bad-9bdd-2b0d7b3dcb6d",
|
|
||||||
"content": "Swallow airspeed is 8.5 m/s"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"source_refs": ["urn:trustgraph:extract:1b9d6bcd-bbfd-4b2d-9b5d-ab8dfbbd4bed"]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"id": "urn:trustgraph:prov:550e8400-e29b-41d4-a716-446655440002",
|
|
||||||
"session_id": "urn:trustgraph:session:7c9e6679-7425-40de-944b-e07fc1f90ae7",
|
|
||||||
"timestamp": "2024-01-15T10:30:01Z",
|
|
||||||
"type": "reasoning",
|
|
||||||
"derived_from": ["urn:trustgraph:prov:550e8400-e29b-41d4-a716-446655440001"],
|
|
||||||
"prompt_summary": "Asked to determine average swallow speed",
|
|
||||||
"conclusion": "Based on retrieved data, average speed is 8.5 m/s"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Provenance Events
|
|
||||||
|
|
||||||
Events streamed to the client during agent execution.
|
|
||||||
|
|
||||||
### Design: Lightweight Reference Events
|
|
||||||
|
|
||||||
Provenance events are lightweight - they reference provenance nodes by IRI rather than embedding full provenance data. This keeps the stream efficient while allowing the client to fetch full details if needed.
|
|
||||||
|
|
||||||
A single agent step may create or modify multiple provenance objects. The event references all of them.
|
|
||||||
|
|
||||||
### Event Structure
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"provenance_refs": [
|
|
||||||
"urn:trustgraph:prov:550e8400-e29b-41d4-a716-446655440001",
|
|
||||||
"urn:trustgraph:prov:550e8400-e29b-41d4-a716-446655440002"
|
|
||||||
]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Integration with Agent Response
|
|
||||||
|
|
||||||
Provenance events extend `AgentResponse` with a new `chunk_type: "provenance"`:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"chunk_type": "provenance",
|
|
||||||
"content": "",
|
|
||||||
"provenance_refs": ["urn:trustgraph:prov:..."],
|
|
||||||
"end_of_message": false
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
This allows provenance updates to flow alongside existing chunk types (`thought`, `observation`, `answer`, `error`).
|
|
||||||
|
|
||||||
## Tool Provenance Reporting
|
|
||||||
|
|
||||||
Tools report provenance as part of their execution.
|
|
||||||
|
|
||||||
### Minimum Reporting (all tools)
|
|
||||||
|
|
||||||
Every tool can report at minimum:
|
|
||||||
- Tool name
|
|
||||||
- Input arguments
|
|
||||||
- Output result
|
|
||||||
|
|
||||||
### Enhanced Reporting (tools that can describe more)
|
|
||||||
|
|
||||||
Tools that understand their internals can report:
|
|
||||||
- What sources were consulted
|
|
||||||
- What reasoning/transformation was applied
|
|
||||||
- Confidence scores
|
|
||||||
- Links to extraction-time provenance
|
|
||||||
|
|
||||||
### Graceful Degradation
|
|
||||||
|
|
||||||
Tools that can't provide detailed provenance still participate:
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"type": "tool_invocation",
|
|
||||||
"tool_name": "calculator",
|
|
||||||
"input": {"expression": "8 + 5"},
|
|
||||||
"output": "13",
|
|
||||||
"detail_level": "basic"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Design Decisions
|
|
||||||
|
|
||||||
### Provenance Node Identity: IRIs with UUIDs
|
|
||||||
|
|
||||||
Provenance nodes use IRIs containing UUIDs, consistent with the RDF-style knowledge graph:
|
|
||||||
- Format: `urn:trustgraph:prov:{uuid}`
|
|
||||||
- Globally unique, persistent across sessions
|
|
||||||
- Can be dereferenced to retrieve full node data
|
|
||||||
|
|
||||||
### Storage Segregation: Separate Collection
|
|
||||||
|
|
||||||
Provenance is stored in a separate collection within the knowledge graph infrastructure. This allows:
|
|
||||||
- Distinct retrieval patterns for provenance vs. data
|
|
||||||
- Independent scaling/retention policies
|
|
||||||
- Clear separation of concerns
|
|
||||||
|
|
||||||
### Client Protocol: Extended AgentResponse
|
|
||||||
|
|
||||||
Provenance events extend `AgentResponse` with `chunk_type: "provenance"`. Events are lightweight, containing only IRI references to provenance nodes created/modified in the step.
|
|
||||||
|
|
||||||
### Retrieval Granularity: Flexible, Multiple Objects Per Step
|
|
||||||
|
|
||||||
A single agent step can create multiple provenance objects. The provenance event references all objects created or modified. This handles cases like:
|
|
||||||
- Retrieval returning multiple facts (each gets a provenance node)
|
|
||||||
- Tool invocation creating both an invocation node and result nodes
|
|
||||||
|
|
||||||
### Graph Structure: True DAG
|
|
||||||
|
|
||||||
The provenance structure is a DAG (not a tree):
|
|
||||||
- A provenance node can have multiple parents (e.g., reasoning combines facts A and B)
|
|
||||||
- Extraction-time nodes can be referenced by multiple query-time sessions
|
|
||||||
- Enables proper modeling of how conclusions derive from multiple sources
|
|
||||||
|
|
||||||
### Linking to Extraction Provenance: Direct IRI Reference
|
|
||||||
|
|
||||||
Query-time provenance references extraction-time provenance via direct IRI links in the `source_refs` field. No separate linking mechanism needed.
|
|
||||||
|
|
||||||
## Open Questions
|
|
||||||
|
|
||||||
### Provenance Retrieval API
|
|
||||||
|
|
||||||
Base layer uses the existing knowledge graph API to query the provenance collection. A higher-level service may be added to provide convenience methods. Details TBD during implementation.
|
|
||||||
|
|
||||||
### Provenance Node Granularity
|
|
||||||
|
|
||||||
Placeholder to explore: What level of detail should different node types capture?
|
|
||||||
- Should `reasoning` nodes include the full LLM prompt, or just a summary?
|
|
||||||
- How much of tool input/output to store?
|
|
||||||
- Trade-offs between completeness and storage/performance
|
|
||||||
|
|
||||||
### Provenance Retention
|
|
||||||
|
|
||||||
TBD - retention policy to be determined:
|
|
||||||
- Indefinitely?
|
|
||||||
- Tied to session retention?
|
|
||||||
- Configurable per collection?
|
|
||||||
|
|
||||||
## Implementation Considerations
|
|
||||||
|
|
||||||
### Files Likely Affected
|
|
||||||
|
|
||||||
| Area | Changes |
|
|
||||||
|------|---------|
|
|
||||||
| Agent service | Generate provenance events |
|
|
||||||
| Tool implementations | Report provenance data |
|
|
||||||
| Agent response schema | Add provenance event type |
|
|
||||||
| Knowledge graph | Provenance storage/retrieval |
|
|
||||||
|
|
||||||
### Backward Compatibility
|
|
||||||
|
|
||||||
- Existing agent clients continue to work (provenance is additive)
|
|
||||||
- Tools that don't report provenance still function
|
|
||||||
|
|
||||||
## References
|
|
||||||
|
|
||||||
- PROV-O (PROV-Ontology): W3C standard for provenance modeling
|
|
||||||
- Current agent implementation: `trustgraph-flow/trustgraph/agent/react/`
|
|
||||||
- Agent schemas: `trustgraph-base/trustgraph/schema/services/agent.py`
|
|
||||||
- Extraction-time provenance notes: `docs/tech-specs/extraction-time-provenance.md`
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue