From ec83775789838d6b0d200c118da7848dad816eee Mon Sep 17 00:00:00 2001 From: cybermaggedon Date: Tue, 10 Mar 2026 10:07:37 +0000 Subject: [PATCH] Update tech spec (#678) --- docs/tech-specs/query-time-explainability.md | 263 +++++++++++++++++ docs/tech-specs/query-time-provenance.md | 282 ------------------- 2 files changed, 263 insertions(+), 282 deletions(-) create mode 100644 docs/tech-specs/query-time-explainability.md delete mode 100644 docs/tech-specs/query-time-provenance.md diff --git a/docs/tech-specs/query-time-explainability.md b/docs/tech-specs/query-time-explainability.md new file mode 100644 index 00000000..3385efc9 --- /dev/null +++ b/docs/tech-specs/query-time-explainability.md @@ -0,0 +1,263 @@ +# Query-Time Explainability + +## Status + +Implemented + +## Overview + +This specification describes how GraphRAG records and communicates explainability data during query execution. The goal is full traceability: from final answer back through selected edges to source documents. + +Query-time explainability captures what the GraphRAG pipeline did during reasoning. It connects to extraction-time provenance which records where knowledge graph facts originated. + +## Terminology + +| Term | Definition | +|------|------------| +| **Explainability** | The record of how a result was derived | +| **Session** | A single GraphRAG query execution | +| **Edge Selection** | LLM-driven selection of relevant edges with reasoning | +| **Provenance Chain** | Path from edge → chunk → page → document | + +## Architecture + +### Explainability Flow + +``` +GraphRAG Query + │ + ├─► Session Activity + │ └─► Query text, timestamp + │ + ├─► Retrieval Entity + │ └─► All edges retrieved from subgraph + │ + ├─► Selection Entity + │ └─► Selected edges with LLM reasoning + │ └─► Each edge links to extraction provenance + │ + └─► Answer Entity + └─► Reference to synthesized response (in librarian) +``` + +### Two-Stage GraphRAG Pipeline + +1. **Edge Selection**: LLM selects relevant edges from subgraph, providing reasoning for each +2. **Synthesis**: LLM generates answer from selected edges only + +This separation enables explainability - we know exactly which edges contributed. + +### Storage + +- Explainability triples stored in configurable collection (default: `explainability`) +- Uses PROV-O ontology for provenance relationships +- RDF-star reification for edge references +- Answer content stored in librarian service (not inline - too large) + +### Real-Time Streaming + +Explainability events stream to client as the query executes: + +1. Session created → event emitted +2. Edges retrieved → event emitted +3. Edges selected with reasoning → event emitted +4. Answer synthesized → event emitted + +Client receives `explain_id` and `explain_collection` to fetch full details. + +## URI Structure + +All URIs use the `urn:trustgraph:` namespace with UUIDs: + +| Entity | URI Pattern | +|--------|-------------| +| Session | `urn:trustgraph:session:{uuid}` | +| Retrieval | `urn:trustgraph:prov:retrieval:{uuid}` | +| Selection | `urn:trustgraph:prov:selection:{uuid}` | +| Answer | `urn:trustgraph:prov:answer:{uuid}` | +| Edge Selection | `urn:trustgraph:prov:edge:{uuid}:{index}` | + +## RDF Model (PROV-O) + +### Session Activity + +```turtle + a prov:Activity ; + rdfs:label "GraphRAG query session" ; + prov:startedAtTime "2024-01-15T10:30:00Z" ; + tg:query "What was the War on Terror?" . +``` + +### Retrieval Entity + +```turtle + a prov:Entity ; + rdfs:label "Retrieved edges" ; + prov:wasGeneratedBy ; + tg:edgeCount 50 . +``` + +### Selection Entity + +```turtle + a prov:Entity ; + rdfs:label "Selected edges" ; + prov:wasDerivedFrom ; + tg:selectedEdge ; + tg:selectedEdge . + + tg:edge <<

>> ; + tg:reasoning "This edge establishes the key relationship..." . +``` + +### Answer Entity + +```turtle + a prov:Entity ; + rdfs:label "GraphRAG answer" ; + prov:wasDerivedFrom ; + tg:document . +``` + +The `tg:document` references the answer stored in the librarian service. + +## Namespace Constants + +Defined in `trustgraph-base/trustgraph/provenance/namespaces.py`: + +| Constant | URI | +|----------|-----| +| `TG_QUERY` | `https://trustgraph.ai/ns/query` | +| `TG_EDGE_COUNT` | `https://trustgraph.ai/ns/edgeCount` | +| `TG_SELECTED_EDGE` | `https://trustgraph.ai/ns/selectedEdge` | +| `TG_EDGE` | `https://trustgraph.ai/ns/edge` | +| `TG_REASONING` | `https://trustgraph.ai/ns/reasoning` | +| `TG_CONTENT` | `https://trustgraph.ai/ns/content` | +| `TG_DOCUMENT` | `https://trustgraph.ai/ns/document` | + +## GraphRagResponse Schema + +```python +@dataclass +class GraphRagResponse: + error: Error | None = None + response: str = "" + end_of_stream: bool = False + explain_id: str | None = None + explain_collection: str | None = None + message_type: str = "" # "chunk" or "explain" + end_of_session: bool = False +``` + +### Message Types + +| message_type | Purpose | +|--------------|---------| +| `chunk` | Response text (streaming or final) | +| `explain` | Explainability event with IRI reference | + +### Session Lifecycle + +1. Multiple `explain` messages (session, retrieval, selection, answer) +2. Multiple `chunk` messages (streaming response) +3. Final `chunk` with `end_of_session=True` + +## Edge Selection Format + +LLM returns JSONL with selected edges: + +```jsonl +{"id": "edge-hash-1", "reasoning": "This edge shows the key relationship..."} +{"id": "edge-hash-2", "reasoning": "Provides supporting evidence..."} +``` + +The `id` is a hash of `(labeled_s, labeled_p, labeled_o)` computed by `edge_id()`. + +## URI Preservation + +### The Problem + +GraphRAG displays human-readable labels to the LLM, but explainability needs original URIs for provenance tracing. + +### Solution + +`get_labelgraph()` returns both: +- `labeled_edges`: List of `(label_s, label_p, label_o)` for LLM +- `uri_map`: Dict mapping `edge_id(labels)` → `(uri_s, uri_p, uri_o)` + +When storing explainability data, URIs from `uri_map` are used. + +## Provenance Tracing + +### From Edge to Source + +Selected edges can be traced back to source documents: + +1. Query for reifying statement: `?stmt tg:reifies <>` +2. Follow `prov:wasDerivedFrom` chain to root document +3. Each step in chain: chunk → page → document + +### Cassandra Quoted Triple Support + +The Cassandra query service supports matching quoted triples: + +```python +# In get_term_value(): +elif term.type == TRIPLE: + return serialize_triple(term.triple) +``` + +This enables queries like: +``` +?stmt tg:reifies <> +``` + +## CLI Usage + +```bash +tg-invoke-graph-rag --explainable -q "What was the War on Terror?" +``` + +### Output Format + +``` +[session] urn:trustgraph:session:abc123 + +[retrieval] urn:trustgraph:prov:retrieval:abc123 + +[selection] urn:trustgraph:prov:selection:abc123 + Selected 12 edge(s) + Edge: (Guantanamo, definition, A detention facility...) + Reason: Directly connects Guantanamo to the War on Terror + Source: Chunk 1 → Page 2 → Beyond the Vigilant State + +[answer] urn:trustgraph:prov:answer:abc123 + +Based on the provided knowledge statements... +``` + +### Features + +- Real-time explainability events during query +- Label resolution for edge components via `rdfs:label` +- Source chain tracing via `prov:wasDerivedFrom` +- Label caching to avoid repeated queries + +## Files Implemented + +| File | Purpose | +|------|---------| +| `trustgraph-base/trustgraph/provenance/uris.py` | URI generators | +| `trustgraph-base/trustgraph/provenance/namespaces.py` | RDF namespace constants | +| `trustgraph-base/trustgraph/provenance/triples.py` | Triple builders | +| `trustgraph-base/trustgraph/schema/services/retrieval.py` | GraphRagResponse schema | +| `trustgraph-flow/trustgraph/retrieval/graph_rag/graph_rag.py` | Core GraphRAG with URI preservation | +| `trustgraph-flow/trustgraph/retrieval/graph_rag/rag.py` | Service with librarian integration | +| `trustgraph-flow/trustgraph/query/triples/cassandra/service.py` | Quoted triple query support | +| `trustgraph-cli/trustgraph/cli/invoke_graph_rag.py` | CLI with explainability display | + +## References + +- PROV-O (W3C Provenance Ontology): https://www.w3.org/TR/prov-o/ +- RDF-star: https://w3c.github.io/rdf-star/ +- Extraction-time provenance: `docs/tech-specs/extraction-time-provenance.md` diff --git a/docs/tech-specs/query-time-provenance.md b/docs/tech-specs/query-time-provenance.md deleted file mode 100644 index 06bfaf70..00000000 --- a/docs/tech-specs/query-time-provenance.md +++ /dev/null @@ -1,282 +0,0 @@ -# Query-Time Provenance: Agent Explainability - -## Status - -Draft - Gathering Requirements - -## Overview - -This specification defines how the agent framework records and communicates provenance during query execution. The goal is full explainability: tracing how a result was obtained, from final answer back through reasoning steps to source data. - -Query-time provenance captures the "inference layer" - what the agent did during reasoning. It connects to extraction-time provenance (source layer) which records where facts came from originally. - -## Terminology - -| Term | Definition | -|------|------------| -| **Provenance** | The record of how a result was derived | -| **Provenance Node** | A single step or artifact in the provenance DAG | -| **Provenance DAG** | Directed Acyclic Graph of provenance relationships | -| **Query-time Provenance** | Provenance generated during agent reasoning | -| **Extraction-time Provenance** | Provenance from data ingestion (source metadata) - separate spec | - -## Architecture - -### Two Provenance Contexts - -1. **Extraction-time** (out of scope for this spec): - - Generated when data is ingested (PDF extraction, web scraping, etc.) - - Records: source URL, extraction method, timestamps, funding, authorship - - Already partially implemented via source metadata in knowledge graph - - See: `docs/tech-specs/extraction-time-provenance.md` (notes) - -2. **Query-time** (this spec): - - Generated during agent reasoning - - Records: tool invocations, retrieval results, LLM reasoning, final conclusions - - Links to extraction-time provenance for retrieved facts - -### Provenance Flow - -``` -Agent Session - │ - ├─► Tool: Knowledge Query - │ │ - │ ├─► Retrieved Fact A ──► [link to extraction provenance] - │ └─► Retrieved Fact B ──► [link to extraction provenance] - │ - ├─► LLM Reasoning Step - │ │ - │ └─► "Combined A and B to conclude X" - │ - └─► Final Answer - │ - └─► Derived from reasoning step above -``` - -### Storage - -- Provenance stored in knowledge graph infrastructure -- Segregated in a **separate collection** for distinct retrieval patterns -- Query-time provenance references extraction-time provenance nodes via IRIs -- Persists beyond agent session (reusable, auditable) - -### Real-Time Streaming - -Provenance events stream back to the client as the agent works: - -1. Agent invokes tool -2. Tool generates provenance data -3. Provenance stored in graph -4. Provenance event sent to client -5. UX builds provenance visualization incrementally - -## Provenance Node Structure - -Each provenance node represents a step in the reasoning process. - -### Node Identity - -Provenance nodes are identified by IRIs containing UUIDs, consistent with the RDF-style knowledge graph: - -``` -urn:trustgraph:prov:550e8400-e29b-41d4-a716-446655440000 -``` - -### Core Fields - -| Field | Description | -|-------|-------------| -| `id` | IRI with UUID (e.g., `urn:trustgraph:prov:{uuid}`) | -| `session_id` | Agent session this belongs to | -| `timestamp` | When this step occurred | -| `type` | Node type (see below) | -| `derived_from` | List of parent node IRIs (DAG edges) | - -### Node Types - -| Type | Description | Additional Fields | -|------|-------------|-------------------| -| `retrieval` | Facts retrieved from knowledge graph | `facts`, `source_refs` | -| `tool_invocation` | Tool was called | `tool_name`, `input`, `output` | -| `reasoning` | LLM reasoning step | `prompt_summary`, `conclusion` | -| `answer` | Final answer produced | `content` | - -### Example Provenance Nodes - -```json -{ - "id": "urn:trustgraph:prov:550e8400-e29b-41d4-a716-446655440001", - "session_id": "urn:trustgraph:session:7c9e6679-7425-40de-944b-e07fc1f90ae7", - "timestamp": "2024-01-15T10:30:00Z", - "type": "retrieval", - "derived_from": [], - "facts": [ - { - "id": "urn:trustgraph:fact:9b1deb4d-3b7d-4bad-9bdd-2b0d7b3dcb6d", - "content": "Swallow airspeed is 8.5 m/s" - } - ], - "source_refs": ["urn:trustgraph:extract:1b9d6bcd-bbfd-4b2d-9b5d-ab8dfbbd4bed"] -} -``` - -```json -{ - "id": "urn:trustgraph:prov:550e8400-e29b-41d4-a716-446655440002", - "session_id": "urn:trustgraph:session:7c9e6679-7425-40de-944b-e07fc1f90ae7", - "timestamp": "2024-01-15T10:30:01Z", - "type": "reasoning", - "derived_from": ["urn:trustgraph:prov:550e8400-e29b-41d4-a716-446655440001"], - "prompt_summary": "Asked to determine average swallow speed", - "conclusion": "Based on retrieved data, average speed is 8.5 m/s" -} -``` - -## Provenance Events - -Events streamed to the client during agent execution. - -### Design: Lightweight Reference Events - -Provenance events are lightweight - they reference provenance nodes by IRI rather than embedding full provenance data. This keeps the stream efficient while allowing the client to fetch full details if needed. - -A single agent step may create or modify multiple provenance objects. The event references all of them. - -### Event Structure - -```json -{ - "provenance_refs": [ - "urn:trustgraph:prov:550e8400-e29b-41d4-a716-446655440001", - "urn:trustgraph:prov:550e8400-e29b-41d4-a716-446655440002" - ] -} -``` - -### Integration with Agent Response - -Provenance events extend `AgentResponse` with a new `chunk_type: "provenance"`: - -```json -{ - "chunk_type": "provenance", - "content": "", - "provenance_refs": ["urn:trustgraph:prov:..."], - "end_of_message": false -} -``` - -This allows provenance updates to flow alongside existing chunk types (`thought`, `observation`, `answer`, `error`). - -## Tool Provenance Reporting - -Tools report provenance as part of their execution. - -### Minimum Reporting (all tools) - -Every tool can report at minimum: -- Tool name -- Input arguments -- Output result - -### Enhanced Reporting (tools that can describe more) - -Tools that understand their internals can report: -- What sources were consulted -- What reasoning/transformation was applied -- Confidence scores -- Links to extraction-time provenance - -### Graceful Degradation - -Tools that can't provide detailed provenance still participate: -```json -{ - "type": "tool_invocation", - "tool_name": "calculator", - "input": {"expression": "8 + 5"}, - "output": "13", - "detail_level": "basic" -} -``` - -## Design Decisions - -### Provenance Node Identity: IRIs with UUIDs - -Provenance nodes use IRIs containing UUIDs, consistent with the RDF-style knowledge graph: -- Format: `urn:trustgraph:prov:{uuid}` -- Globally unique, persistent across sessions -- Can be dereferenced to retrieve full node data - -### Storage Segregation: Separate Collection - -Provenance is stored in a separate collection within the knowledge graph infrastructure. This allows: -- Distinct retrieval patterns for provenance vs. data -- Independent scaling/retention policies -- Clear separation of concerns - -### Client Protocol: Extended AgentResponse - -Provenance events extend `AgentResponse` with `chunk_type: "provenance"`. Events are lightweight, containing only IRI references to provenance nodes created/modified in the step. - -### Retrieval Granularity: Flexible, Multiple Objects Per Step - -A single agent step can create multiple provenance objects. The provenance event references all objects created or modified. This handles cases like: -- Retrieval returning multiple facts (each gets a provenance node) -- Tool invocation creating both an invocation node and result nodes - -### Graph Structure: True DAG - -The provenance structure is a DAG (not a tree): -- A provenance node can have multiple parents (e.g., reasoning combines facts A and B) -- Extraction-time nodes can be referenced by multiple query-time sessions -- Enables proper modeling of how conclusions derive from multiple sources - -### Linking to Extraction Provenance: Direct IRI Reference - -Query-time provenance references extraction-time provenance via direct IRI links in the `source_refs` field. No separate linking mechanism needed. - -## Open Questions - -### Provenance Retrieval API - -Base layer uses the existing knowledge graph API to query the provenance collection. A higher-level service may be added to provide convenience methods. Details TBD during implementation. - -### Provenance Node Granularity - -Placeholder to explore: What level of detail should different node types capture? -- Should `reasoning` nodes include the full LLM prompt, or just a summary? -- How much of tool input/output to store? -- Trade-offs between completeness and storage/performance - -### Provenance Retention - -TBD - retention policy to be determined: -- Indefinitely? -- Tied to session retention? -- Configurable per collection? - -## Implementation Considerations - -### Files Likely Affected - -| Area | Changes | -|------|---------| -| Agent service | Generate provenance events | -| Tool implementations | Report provenance data | -| Agent response schema | Add provenance event type | -| Knowledge graph | Provenance storage/retrieval | - -### Backward Compatibility - -- Existing agent clients continue to work (provenance is additive) -- Tools that don't report provenance still function - -## References - -- PROV-O (PROV-Ontology): W3C standard for provenance modeling -- Current agent implementation: `trustgraph-flow/trustgraph/agent/react/` -- Agent schemas: `trustgraph-base/trustgraph/schema/services/agent.py` -- Extraction-time provenance notes: `docs/tech-specs/extraction-time-provenance.md`