Commit graph

1051 commits

Author SHA1 Message Date
Cyber MacGeddon
f02bbdb442 New CLA workflow: Uses a github action in
trustgraph-ai/contributor-license-agreement

This blocks a PR until the commiter responds with a message
of agreement with the CLA terms.
2026-03-26 14:09:07 +00:00
cybermaggedon
4164ef1c47
Add GATEWAY_SECRET support for MCP server to API gateway auth (#721)
Pass bearer token from GATEWAY_SECRET environment variable as a
URL query parameter on websocket connections to the API gateway.
When unset or empty, no auth is applied (backwards compatible).
2026-03-26 10:49:28 +00:00
cybermaggedon
97f5645ea0
CLA (#716)
Explanatory text for the CLA process
2026-03-26 09:08:09 +00:00
cybermaggedon
1f67fc2312
master -> release/v2.2 (#713)
Merge doc updates from master into release branch
2026-03-25 17:53:20 +00:00
cybermaggedon
9330730afb
Add chunk content ID to explain trace provenance output (#708)
When --show-provenance is used with tg-show-explain-trace,
display the chunk URI on a Content: line below each Source:
chain. This allows the user to easily fetch the source text
with tg-get-document-content.
2026-03-23 16:20:52 +00:00
cybermaggedon
25995d03f4
Fix stray log messages caused by librarian messages (#706)
Warning generated by librarian responses meant for other
services (chunker, embeddings, etc.) arriving on the shared
response queue. The decoder's subscription picks them up, can't
match them to a pending request, and logs a warning.

Removed the warnings, as not serving a purpose.
2026-03-23 13:16:39 +00:00
cybermaggedon
5c6fe90fe2
Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.

New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.

All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable.  PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.

Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.

New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.

Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00
cybermaggedon
4609424afe
Prepare 2.2 release branch (#704) 2026-03-22 15:23:23 +00:00
cybermaggedon
96fd1eab15
Use UUID-based URNs for page and chunk IDs (#703)
Page and chunk document IDs were deterministic ({doc_id}/p{num},
{doc_id}/p{num}/c{num}), causing "Document already exists" errors
when reprocessing documents through different flows. Content may
differ between runs due to different parameters or extractors, so
deterministic IDs are incorrect.

Pages now use urn:page:{uuid}, chunks use
urn:chunk:{uuid}. Parent- child relationships are tracked via
librarian metadata and provenance triples.

Also brings Mistral OCR and Tesseract OCR decoders up to parity
with the PDF decoder: librarian fetch/save support, per-page
output with unique IDs, and provenance triple emission. Fixes
Mistral OCR bug where only the first 5 pages were processed.
2026-03-21 21:17:03 +00:00
cybermaggedon
1a7b654bd3
Add semantic pre-filter for GraphRAG edge scoring (#702)
Embed edge descriptions and compute cosine similarity against grounding
concepts to reduce the number of edges sent to expensive LLM scoring.
Controlled by edge_score_limit parameter (default 30), skipped when edge
count is already below the limit.

Also plumbs edge_score_limit and edge_limit parameters end-to-end:
- CLI args (--edge-score-limit, --edge-limit) in both invoke and service
- Socket client: fix parameter mapping to use hyphenated wire-format keys
- Flow API, message translator, gateway all pass through correctly
- Explainable code path (_question_explainable_api) now forwards all params
- Default edge_score_limit changed from 50 to 30 based on typical subgraph
  sizes
2026-03-21 20:06:29 +00:00
cybermaggedon
bc68738c37
README.md from master (#701) 2026-03-17 20:54:04 +00:00
cybermaggedon
664d1d0384
Update API specs for 2.1 (#699)
* Updating API specs for 2.1

* Updated API and SDK docs
2026-03-17 20:36:31 +00:00
cybermaggedon
c387670944
Fix incorrect property names in explainability (#698)
Remove type suffixes from explainability dataclass fields + fix show_explain_trace

Rename dataclass fields to match KG property naming conventions:
- Analysis: thought_uri/observation_uri → thought/observation
- Synthesis/Conclusion/Reflection: document_uri → document

Fix show_explain_trace for current API:
- Resolve document content via librarian fetch instead of removed
  inline content fields (synthesis.content, conclusion.answer)
- Add Grounding display for DocRAG traces
- Update fetch_docrag_trace chain: Question → Grounding → Exploration →
Synthesis
- Pass api/explain_client to all print functions for content resolution

Update all CLI tools and tests for renamed fields.
2026-03-16 14:47:37 +00:00
cybermaggedon
a115ec06ab
Enhance retrieval pipelines: 4-stage GraphRAG, DocRAG grounding (#697)
Enhance retrieval pipelines: 4-stage GraphRAG, DocRAG grounding,
consistent PROV-O

GraphRAG:
- Split retrieval into 4 prompt stages: extract-concepts,
  kg-edge-scoring,
  kg-edge-reasoning, kg-synthesis (was single-stage)
- Add concept extraction (grounding) for per-concept embedding
- Filter main query to default graph, ignoring
  provenance/explainability edges
- Add source document edges to knowledge graph

DocumentRAG:
- Add grounding step with concept extraction, matching GraphRAG's
  pattern:
  Question → Grounding → Exploration → Synthesis
- Per-concept embedding and chunk retrieval with deduplication

Cross-pipeline:
- Make PROV-O derivation links consistent: wasGeneratedBy for first
  entity from Activity, wasDerivedFrom for entity-to-entity chains
- Update CLIs (tg-invoke-agent, tg-invoke-graph-rag,
  tg-invoke-document-rag) for new explainability structure
- Fix all affected unit and integration tests
2026-03-16 12:12:13 +00:00
cybermaggedon
29b4300808
Updated test suite for explainability & provenance (#696)
* Provenance tests

* Embeddings tests

* Test librarian

* Test triples stream

* Test concurrency

* Entity centric graph writes

* Agent tool service tests

* Structured data tests

* RDF tests

* Addition LLM tests

* Reliability tests
2026-03-13 14:27:42 +00:00
cybermaggedon
e6623fc915
Remove schema:subjectOf edges from KG extraction (#695)
The subjectOf triples were redundant with the subgraph provenance model
introduced in e8407b34. Entity-to-source lineage can be traced via
tg:contains -> subgraph -> prov:wasDerivedFrom -> chunk, making the
direct subjectOf edges unnecessary metadata polluting the knowledge graph.

Removed from all three extractors (agent, definitions, relationships),
cleaned up the SUBJECT_OF constant and vocabulary label, and updated
tests accordingly.
2026-03-13 12:11:21 +00:00
cybermaggedon
64e3f6bd0d
Subgraph provenance (#694)
Replace per-triple provenance reification with subgraph model

Extraction provenance previously created a full reification (statement
URI, activity, agent) for every single extracted triple, producing ~13
provenance triples per knowledge triple.  Since each chunk is processed
by a single LLM call, this was both redundant and semantically
inaccurate.

Now one subgraph object is created per chunk extraction, with
tg:contains linking to each extracted triple.  For 20 extractions from
a chunk this reduces provenance from ~260 triples to ~33.

- Rename tg:reifies -> tg:contains, stmt_uri -> subgraph_uri
- Replace triple_provenance_triples() with subgraph_provenance_triples()
- Refactor kg-extract-definitions and kg-extract-relationships to
  generate provenance once per chunk instead of per triple
- Add subgraph provenance to kg-extract-ontology and kg-extract-agent
  (previously had none)
- Update CLI tools and tech specs to match

Also rename tg-show-document-hierarchy to tg-show-extraction-provenance.

Added extra typing for extraction provenance, fixed extraction prov CLI
2026-03-13 11:37:59 +00:00
cybermaggedon
35128ff019
Add unified explainability support and librarian storage for (#693)
Add unified explainability support and librarian storage for all retrieval engines

Implements consistent explainability/provenance tracking
across GraphRAG, DocumentRAG, and Agent retrieval
engines. All large content (answers, thoughts, observations)
is now stored in librarian rather than as inline literals in
the knowledge graph.

Explainability API:
- New explainability.py module with entity classes (Question,
  Exploration, Focus, Synthesis, Analysis, Conclusion) and
  ExplainabilityClient
- Quiescence-based eventual consistency handling for trace
  fetching
- Content fetching from librarian with retry logic

CLI updates:
- tg-invoke-graph-rag -x/--explainable flag returns
  explain_id
- tg-invoke-document-rag -x/--explainable flag returns
  explain_id
- tg-invoke-agent -x/--explainable flag returns explain_id
- tg-list-explain-traces uses new explainability API
- tg-show-explain-trace handles all three trace types

Agent provenance:
- Records session, iterations (think/act/observe), and conclusion
- Stores thoughts and observations in librarian with document
  references
- New predicates: tg:thoughtDocument, tg:observationDocument

DocumentRAG provenance:
- Records question, exploration (chunk retrieval), and synthesis
- Stores answers in librarian with document references

Schema changes:
- AgentResponse: added explain_id, explain_graph fields
- RetrievalResponse: added explain_id, explain_graph fields
- agent_iteration_triples: supports thought_document_id,
  observation_document_id

Update tests.
2026-03-12 21:40:09 +00:00
cybermaggedon
aecf00f040
Minor agent tweaks (#692)
Update RAG and Agent clients for streaming message handling

GraphRAG now sends multiple message types in a stream:
- 'explain' messages with explain_id and explain_graph for
  provenance
- 'chunk' messages with response text fragments
- end_of_session marker for stream completion

Updated all clients to handle this properly:

CLI clients (trustgraph-base/trustgraph/clients/):
- graph_rag_client.py: Added chunk_callback and explain_callback
- document_rag_client.py: Added chunk_callback and explain_callback
- agent_client.py: Added think, observe, answer_callback,
  error_callback

Internal clients (trustgraph-base/trustgraph/base/):
- graph_rag_client.py: Async callbacks for streaming
- agent_client.py: Async callbacks for streaming

All clients now:
- Route messages by chunk_type/message_type
- Stream via optional callbacks for incremental delivery
- Wait for proper completion signals
(end_of_dialog/end_of_session/end_of_stream)
- Accumulate and return complete response for callers not using
  callbacks

Updated callers:
- extract/kg/agent/extract.py: Uses new invoke(question=...) API
- tests/integration/test_agent_kg_extraction_integration.py:
  Updated mocks

This fixes the agent infinite loop issue where knowledge_query was
returning the first 'explain' message (empty response) instead of
waiting for the actual answer chunks.

Concurrency in triples query
2026-03-12 17:59:02 +00:00
cybermaggedon
45e6ad4abc
Fix ontology RAG pipeline + add query concurrency (#691)
- Fix ontology RAG pipeline: embeddings API, chunker provenance, and query concurrency

- Fix ontology embeddings to use correct response shape from embed()
  API (returns list of vectors, not list of list of vectors).
- Simplify chunker URI logic to append /c{index} to parent ID
  instead of parsing page/doc URI structure which was fragile.

- Add provenance tracking and librarian integration to token
  chunker, matching recursive chunker capabilities.

- Add configurable concurrency (default 10) to Cassandra, Qdrant,
  and embeddings query services.
2026-03-12 11:34:42 +00:00
cybermaggedon
312174eb88
Adding explainability to the ReACT agent (#689)
* Added tech spec

* Add provenance recording to React agent loop

Enables agent sessions to be traced and debugged using the same
explainability infrastructure as GraphRAG. Agent traces record:
- Session start with query and timestamp
- Each iteration's thought, action, arguments, and observation
- Final answer with derivation chain

Changes:
- Add session_id and collection fields to AgentRequest schema
- Add agent predicates (TG_THOUGHT, TG_ACTION, etc.) to namespaces
- Create agent provenance triple generators in provenance/agent.py
- Register explainability producer in agent service
- Emit provenance triples during agent execution
- Update CLI tools to detect and render agent traces alongside GraphRAG

* Updated explainability taxonomy:

GraphRAG: tg:Question → tg:Exploration → tg:Focus → tg:Synthesis

Agent: tg:Question → tg:Analysis(s) → tg:Conclusion

All entities also have their PROV-O type (prov:Activity or prov:Entity).

Updated commit message:

Add provenance recording to React agent loop

Enables agent sessions to be traced and debugged using the same
explainability infrastructure as GraphRAG.

Entity types follow human reasoning patterns:
- tg:Question - the user's query (shared with GraphRAG)
- tg:Analysis - each think/act/observe cycle
- tg:Conclusion - the final answer

Also adds explicit TG types to GraphRAG entities:
- tg:Question, tg:Exploration, tg:Focus, tg:Synthesis

All types retain their PROV-O base types (prov:Activity, prov:Entity).

Changes:
- Add session_id and collection fields to AgentRequest schema
- Add explainability entity types to namespaces.py
- Create agent provenance triple generators
- Register explainability producer in agent service
- Emit provenance triples during agent execution
- Update CLI tools to detect and render both trace types

* Document RAG explainability is now complete. Here's a summary of the
changes made:

Schema Changes:
- trustgraph-base/trustgraph/schema/services/retrieval.py: Added
  explain_id and explain_graph fields to DocumentRagResponse
- trustgraph-base/trustgraph/messaging/translators/retrieval.py:
  Updated translator to handle explainability fields

Provenance Changes:
- trustgraph-base/trustgraph/provenance/namespaces.py: Added
  TG_CHUNK_COUNT and TG_SELECTED_CHUNK predicates
- trustgraph-base/trustgraph/provenance/uris.py: Added
  docrag_question_uri, docrag_exploration_uri, docrag_synthesis_uri
  generators
- trustgraph-base/trustgraph/provenance/triples.py: Added
  docrag_question_triples, docrag_exploration_triples,
  docrag_synthesis_triples builders
- trustgraph-base/trustgraph/provenance/__init__.py: Exported all
  new Document RAG functions and predicates

Service Changes:
- trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py:
  Added explainability callback support and triple emission at each
  phase (Question → Exploration → Synthesis)
- trustgraph-flow/trustgraph/retrieval/document_rag/rag.py:
  Registered explainability producer and wired up the callback

Documentation:
- docs/tech-specs/agent-explainability.md: Added Document RAG entity
  types and provenance model documentation

Document RAG Provenance Model:
Question (urn:trustgraph:docrag:{uuid})
    │
    │  tg:query, prov:startedAtTime
    │  rdf:type = prov:Activity, tg:Question
    │
    ↓ prov:wasGeneratedBy
    │
Exploration (urn:trustgraph:docrag:{uuid}/exploration)
    │
    │  tg:chunkCount, tg:selectedChunk (multiple)
    │  rdf:type = prov:Entity, tg:Exploration
    │
    ↓ prov:wasDerivedFrom
    │
Synthesis (urn:trustgraph:docrag:{uuid}/synthesis)
    │
    │  tg:content = "The answer..."
    │  rdf:type = prov:Entity, tg:Synthesis

* Specific subtype that makes the retrieval mechanism immediately
obvious:

System: GraphRAG
TG Types on Question: tg:Question, tg:GraphRagQuestion
URI Pattern: urn:trustgraph:question:{uuid}
────────────────────────────────────────
System: Document RAG
TG Types on Question: tg:Question, tg:DocRagQuestion
URI Pattern: urn:trustgraph:docrag:{uuid}
────────────────────────────────────────
System: Agent
TG Types on Question: tg:Question, tg:AgentQuestion
URI Pattern: urn:trustgraph:agent:{uuid}
Files modified:
- trustgraph-base/trustgraph/provenance/namespaces.py - Added
TG_GRAPH_RAG_QUESTION, TG_DOC_RAG_QUESTION, TG_AGENT_QUESTION
- trustgraph-base/trustgraph/provenance/triples.py - Added subtype to
question_triples and docrag_question_triples
- trustgraph-base/trustgraph/provenance/agent.py - Added subtype to
agent_session_triples
- trustgraph-base/trustgraph/provenance/__init__.py - Exported new types
- docs/tech-specs/agent-explainability.md - Documented the subtypes

This allows:
- Query all questions: ?q rdf:type tg:Question
- Query only GraphRAG: ?q rdf:type tg:GraphRagQuestion
- Query only Document RAG: ?q rdf:type tg:DocRagQuestion
- Query only Agent: ?q rdf:type tg:AgentQuestion

* Fixed tests
2026-03-11 15:28:15 +00:00
cybermaggedon
a53ed41da2
Add explainability CLI tools (#688)
Add explainability CLI tools for debugging provenance data
- tg-show-document-hierarchy: Display document → page → chunk → edge
  hierarchy by traversing prov:wasDerivedFrom relationships
- tg-list-explain-traces: List all GraphRAG sessions with questions
  and timestamps from the retrieval graph
- tg-show-explain-trace: Show full explainability cascade for a
  GraphRAG session (question → exploration → focus → synthesis)

These tools query the source and retrieval graphs to help debug
and explore provenance/explainability data stored during document
processing and GraphRAG queries.
2026-03-11 13:44:29 +00:00
cybermaggedon
fda508fdae
Lock mistralai to <2.0.0, avoids a breaking change (#687) 2026-03-11 12:19:04 +00:00
cybermaggedon
286f762369
The id field in pipeline Metadata was being overwritten at each processing (#686)
The id field in pipeline Metadata was being overwritten at each processing
stage (document → page → chunk), causing knowledge storage to create
separate cores per chunk instead of grouping by document.

Add a root field that:
- Is set by librarian to the original document ID
- Is copied unchanged through PDF decoder, chunkers, and extractors
- Is used by knowledge storage for document_id grouping (with fallback to id)

Changes:
- Add root field to Metadata schema with empty string default
- Set root=document.id in librarian when initiating document processing
- Copy root through PDF decoder, recursive chunker, and all extractors
- Update knowledge storage to use root (or id as fallback) for grouping
- Add root handling to translators and gateway serialization
- Update test mock Metadata class to include root parameter
2026-03-11 12:16:39 +00:00
cybermaggedon
aa4f5c6c00
Remove redundant metadata (#685)
The metadata field (list of triples) in the pipeline Metadata class
was redundant. Document metadata triples already flow directly from
librarian to triple-store via emit_document_provenance() - they don't
need to pass through the extraction pipeline.

Additionally, chunker and PDF decoder were overwriting metadata to []
anyway, so any metadata passed through the pipeline was being
discarded.

Changes:
- Remove metadata field from Metadata dataclass
  (schema/core/metadata.py)
- Update all Metadata instantiations to remove metadata=[]
  parameter
- Remove metadata handling from translators (document_loading,
  knowledge)
- Remove metadata consumption from extractors (ontology, agent)
- Update gateway serializers and import handlers
- Update all unit, integration, and contract tests
2026-03-11 10:51:39 +00:00
cybermaggedon
1837d73f34
Dataflow tech spec for extraction (#684) 2026-03-11 10:49:50 +00:00
cybermaggedon
ca525b679f
Filter answers from document lists (#683) 2026-03-10 17:22:19 +00:00
cybermaggedon
e1bc4c04a4
Terminology Rename, and named-graphs for explainability (#682)
Terminology Rename, and named-graphs for explainability data

Changed terminology:
  - session -> question
  - retrieval -> exploration
  - selection -> focus
  - answer -> synthesis

- uris.py: Renamed query_session_uri → question_uri,
  retrieval_uri → exploration_uri, selection_uri → focus_uri,
  answer_uri → synthesis_uri
- triples.py: Renamed corresponding triple generation functions with
  updated labels ("GraphRAG question", "Exploration", "Focus",
  "Synthesis")
- namespaces.py: Added named graph constants GRAPH_DEFAULT,
  GRAPH_SOURCE, GRAPH_RETRIEVAL
- init.py: Updated exports
- graph_rag.py: Updated to use new terminology
- invoke_graph_rag.py: Updated CLI to display new stage names
  (Question, Exploration, Focus, Synthesis)

Query-Time Explainability → Named Graph
- triples.py: Added set_graph() helper function to set named graph
  on triples
- graph_rag.py: All explainability triples now use GRAPH_RETRIEVAL
  named graph
- rag.py: Explainability triples stored in user's collection (not
  separate collection) with named graph

Extraction Provenance → Named Graph
- relationships/extract.py: Provenance triples use GRAPH_SOURCE
  named graph
- definitions/extract.py: Provenance triples use GRAPH_SOURCE
  named graph
- chunker.py: Provenance triples use GRAPH_SOURCE named graph
- pdf_decoder.py: Provenance triples use GRAPH_SOURCE named graph

CLI Updates
- show_graph.py: Added -g/--graph option to filter by named graph and
  --show-graph to display graph column

Also:
- Fix knowledge core schemas
2026-03-10 14:35:21 +00:00
cybermaggedon
57eda65674
Knowledge core processing updated for embeddings interface change (#681)
Knowledge core fixed: 
- trustgraph-flow/trustgraph/tables/knowledge.py - v.vector, v.chunk_id
- trustgraph-base/trustgraph/messaging/translators/document_loading.py -
  chunk.vector
- trustgraph-base/trustgraph/messaging/translators/knowledge.py -
  entity.vector
- trustgraph-flow/trustgraph/gateway/dispatch/serialize.py - entity.vector,
  chunk.vector

Test fixtures fixed:
- tests/unit/test_storage/conftest.py - All mock entities/chunks use vector
- tests/unit/test_query/conftest.py - All mock requests use vector
- tests/unit/test_query/test_doc_embeddings_pinecone_query.py - All mock
  messages use vector

These changes align with commit f2ae0e86 which changed the schema from
vectors: list[list[float]] to vector: list[float].
2026-03-10 13:28:16 +00:00
cybermaggedon
84941ce645
Fix Cassandra schema and graph filter semantics (#680)
Schema fix (dtype/lang clustering key):
- Add dtype and lang to PRIMARY KEY in quads_by_entity table
- Add otype, dtype, lang to PRIMARY KEY in quads_by_collection table
- Fixes deduplication bug where literals with same value but different
  datatype or language tag were collapsed (e.g., "thing" vs "thing"@en)
- Update delete_collection to pass new clustering columns
- Update tech spec to reflect new schema

Graph filter semantics (simplified, no wildcard constant):
- g=None means all graphs (no filter)
- g="" means default graph only
- g="uri" means specific named graph
- Remove GRAPH_WILDCARD usage from EntityCentricKnowledgeGraph
- Fix service.py streaming and non-streaming paths
- Fix CLI to preserve empty string for -g '' argument
2026-03-10 12:52:51 +00:00
cybermaggedon
c951562189
Graph query CLI tool (#679)
New CLI tool that enables selective queries against the triple store
unlike tg-show-graph which dumps the entire graph.

Features:
- Filter by subject, predicate, object, and/or named graph
- Auto-detection of term types (IRI, literal, quoted triple)
- Two ways to specify quoted triples:
  - Inline Turtle-style: -o "<<s p o>>"
  - Explicit flags: --qt-subject, --qt-predicate, --qt-object
- Output formats: space-separated, pipe-separated, JSON, JSON Lines
- Streaming mode for efficient large result sets

Auto-detection rules:
- http://, https://, urn:, or <wrapped> -> IRI
- <<s p o>> -> quoted triple
- Otherwise -> literal
2026-03-10 11:03:34 +00:00
cybermaggedon
ec83775789
Update tech spec (#678) 2026-03-10 10:07:37 +00:00
cybermaggedon
7a6197d8c3
GraphRAG Query-Time Explainability (#677)
Implements full explainability pipeline for GraphRAG queries, enabling
traceability from answers back to source documents.

Renamed throughout for clarity:
- provenance_callback → explain_callback
- provenance_id → explain_id
- provenance_collection → explain_collection
- message_type "provenance" → "explain"
- Queue name "provenance" → "explainability"

GraphRAG queries now emit explainability events as they execute:
1. Session - query text and timestamp
2. Retrieval - edges retrieved from subgraph
3. Selection - selected edges with LLM reasoning (JSONL with id +
   reasoning)
4. Answer - reference to synthesized response

Events stream via explain_callback during query(), enabling
real-time UX.

- Answers stored in librarian service (not inline in graph - too large)
- Document ID as URN: urn:trustgraph:answer:{session_id}
- Graph stores tg:document reference (IRI) to librarian document
- Added librarian producer/consumer to graph-rag service

- get_labelgraph() now returns (labeled_edges, uri_map)
- uri_map maps edge_id(label_s, label_p, label_o) →
  (uri_s, uri_p, uri_o)
- Explainability data stores original URIs, not labels
- Enables tracing edges back to reifying statements via tg:reifies

- Added serialize_triple() to query service (matches storage format)
- get_term_value() now handles TRIPLE type terms
- Enables querying by quoted triple in object position:
  ?stmt tg:reifies <<s p o>>

- Displays real-time explainability events during query
- Resolves rdfs:label for edge components (s, p, o)
- Traces source chain via prov:wasDerivedFrom to root document
- Output: "Source: Chunk 1 → Page 2 → Document Title"
- Label caching to avoid repeated queries

GraphRagResponse:
- explain_id: str | None
- explain_collection: str | None
- message_type: str ("chunk" or "explain")
- end_of_session: bool

trustgraph-base/trustgraph/provenance/:
- namespaces.py - Added TG_DOCUMENT predicate
- triples.py - answer_triples() supports document_id reference
- uris.py - Added edge_selection_uri()

trustgraph-base/trustgraph/schema/services/retrieval.py:
- GraphRagResponse with explain_id, explain_collection, end_of_session

trustgraph-flow/trustgraph/retrieval/graph_rag/:
- graph_rag.py - URI preservation, streaming answer accumulation
- rag.py - Librarian integration, real-time explain emission

trustgraph-flow/trustgraph/query/triples/cassandra/service.py:
- Quoted triple serialization for query matching

trustgraph-cli/trustgraph/cli/invoke_graph_rag.py:
- Full explainability display with label resolution and source tracing
2026-03-10 10:00:01 +00:00
cybermaggedon
d2d71f859d
Feature/streaming triples (#676)
* Steaming triples

* Also GraphRAG service uses this

* Updated tests
2026-03-09 15:46:33 +00:00
cybermaggedon
3c3e11bef5
Fix/librarian broken (#674)
* Set end-of-stream cleanly - clean streaming message structures

* Add tg-get-document-content
2026-03-09 13:36:24 +00:00
cybermaggedon
df1808768d
Fix/doc streaming proto (#673)
* Librarian streaming doc download

* Document stream download endpoint
2026-03-09 12:36:10 +00:00
cybermaggedon
b2ef7bbb8c
Fix doc embeddings invocation (#672)
* Fix doc embeddings invocation

* Tidy query embeddings invocation
2026-03-09 11:07:32 +00:00
cybermaggedon
f2ae0e8623
Embeddings API scores (#671)
- Put scores in all responses
- Remove unused 'middle' vector layer. Vector of texts -> vector of (vector embedding)
2026-03-09 10:53:44 +00:00
cybermaggedon
4fa7cc7d7c
Fix/embeddings integration 2 (#670) 2026-03-08 19:42:26 +00:00
cybermaggedon
919b760c05
Update embeddings integration for new batch embeddings interfaces (#669)
* Fix vector extraction

* Fix embeddings integration
2026-03-08 19:41:52 +00:00
cybermaggedon
0a2ce47a88
Batch embeddings (#668)
Base Service (trustgraph-base/trustgraph/base/embeddings_service.py):
- Changed on_request to use request.texts

FastEmbed Processor
(trustgraph-flow/trustgraph/embeddings/fastembed/processor.py):
- on_embeddings(texts, model=None) now processes full batch efficiently
- Returns [[v.tolist()] for v in vecs] - list of vector sets

Ollama Processor (trustgraph-flow/trustgraph/embeddings/ollama/processor.py):
- on_embeddings(texts, model=None) passes list directly to Ollama
- Returns [[embedding] for embedding in embeds.embeddings]

EmbeddingsClient (trustgraph-base/trustgraph/base/embeddings_client.py):
- embed(texts, timeout=300) accepts list of texts

Tests Updated:
- test_fastembed_dynamic_model.py - 4 tests updated for new interface
- test_ollama_dynamic_model.py - 4 tests updated for new interface

Updated CLI, SDK and APIs
2026-03-08 18:36:54 +00:00
cybermaggedon
3bf8a65409
Fix tests (#666) 2026-03-07 23:38:09 +00:00
cybermaggedon
24bbe94136
Document chunks not stored in vector store (#665)
- Schema - ChunkEmbeddings now uses chunk_id: str instead of chunk: bytes
- Schema - DocumentEmbeddingsResponse now returns chunk_ids: list[str]
  instead of chunks
- Translators - Updated to serialize/deserialize chunk_id
- Clients - DocumentEmbeddingsClient.query() returns chunk_ids
- SDK/API - flow.py, socket_client.py, bulk_client.py updated
- Document embeddings service - Stores chunk_id (document ID) instead
  of chunk text
- Storage writers - Qdrant, Milvus, Pinecone store chunk_id in payload
- Query services - Return chunk_id from vector store searches
- Gateway dispatchers - Serialize chunk_id in API responses
- Document RAG - Added librarian client to fetch chunk content from
  Garage using chunk_ids
- CLI tools - Updated all three tools:
  - invoke_document_embeddings.py - displays chunk_ids, removed
    max_chunk_length
  - save_doc_embeds.py - exports chunk_id
  - load_doc_embeds.py - imports chunk_id
2026-03-07 23:10:45 +00:00
cybermaggedon
be358efe67
Fix tests (#663) 2026-03-06 12:40:02 +00:00
cybermaggedon
2b9232917c
Fix/extraction prov (#662)
Quoted triple fixes, including...

1. Updated triple_provenance_triples() in triples.py:
   - Now accepts a Triple object directly
   - Creates the reification triple using TRIPLE term type: stmt_uri tg:reifies
         <<extracted_triple>>
   - Includes it in the returned provenance triples
    
2. Updated definitions extractor:
   - Added imports for provenance functions and component version
   - Added ParameterSpec for optional llm-model and ontology flow parameters
   - For each definition triple, generates provenance with reification
    
3. Updated relationships extractor:
   - Same changes as definitions extractor
2026-03-06 12:23:58 +00:00
cybermaggedon
cd5580be59
Extract-time provenance (#661)
1. Shared Provenance Module - URI generators, namespace constants,
   triple builders, vocabulary bootstrap
2. Librarian - Emits document metadata to graph on processing
   initiation (vocabulary bootstrap + PROV-O triples)
3. PDF Extractor - Saves pages as child documents, emits parent-child
   provenance edges, forwards page IDs
4. Chunker - Saves chunks as child documents, emits provenance edges,
   forwards chunk ID + content
5. Knowledge Extractors (both definitions and relationships):
   - Link entities to chunks via SUBJECT_OF (not top-level document)
   - Removed duplicate metadata emission (now handled by librarian)
   - Get chunk_doc_id and chunk_uri from incoming Chunk message
6. Embedding Provenance:
   - EntityContext schema has chunk_id field
   - EntityEmbeddings schema has chunk_id field
   - Definitions extractor sets chunk_id when creating EntityContext
   - Graph embeddings processor passes chunk_id through to
     EntityEmbeddings

Provenance Flow:
Document → Page (PDF) → Chunk → Extracted Facts/Embeddings
    ↓           ↓          ↓              ↓
  librarian  librarian  librarian    (chunk_id reference)
  + graph    + graph    + graph

Each artifact is stored in librarian with parent-child linking, and PROV-O
edges are emitted to the knowledge graph for full traceability from any
extracted fact back to its source document.

Also, updating tests
2026-03-05 18:36:10 +00:00
cybermaggedon
d8f0a576af
Document API updates (#660)
* Doc streaming from librarian

* Fix chunk minimum confusion

* Add CLI args
2026-03-05 15:20:45 +00:00
cybermaggedon
a630e143ef
Incremental / large document loading (#659)
Tech spec

BlobStore (trustgraph-flow/trustgraph/librarian/blob_store.py):
- get_stream() - yields document content in chunks for streaming retrieval
- create_multipart_upload() - initializes S3 multipart upload, returns
  upload_id
- upload_part() - uploads a single part, returns etag
- complete_multipart_upload() - finalizes upload with part etags
- abort_multipart_upload() - cancels and cleans up

Cassandra schema (trustgraph-flow/trustgraph/tables/library.py):
- New upload_session table with 24-hour TTL
- Index on user for listing sessions
- Prepared statements for all operations
- Methods: create_upload_session(), get_upload_session(),
  update_upload_session_chunk(), delete_upload_session(),
  list_upload_sessions()

- Schema extended with UploadSession, UploadProgress, and new
  request/response fields
- Librarian methods: begin_upload, upload_chunk, complete_upload,
  abort_upload, get_upload_status, list_uploads
- Service routing for all new operations
- Python SDK with transparent chunked upload:
  - add_document() auto-switches to chunked for files > 10MB
  - Progress callback support (on_progress)
  - get_pending_uploads(), get_upload_status(), abort_upload(),
    resume_upload()

- Document table: Added parent_id and document_type columns with index
- Document schema (knowledge/document.py): Added document_id field for
  streaming retrieval
- Librarian operations:
  - add-child-document for extracted PDF pages
  - list-children to get child documents
  - stream-document for chunked content retrieval
  - Cascade delete removes children when parent is deleted
  - list-documents filters children by default
- PDF decoder (decoding/pdf/pdf_decoder.py): Updated to stream large
  documents from librarian API to temp file
- Librarian service (librarian/service.py): Sends document_id instead of
  content for large PDFs (>2MB)
- Deprecated tools (load_pdf.py, load_text.py): Added deprecation
  warnings directing users to tg-add-library-document +
  tg-start-library-processing

Remove load_pdf and load_text utils

Move chunker/librarian comms to base class

Updating tests
2026-03-04 16:57:58 +00:00
cybermaggedon
a38ca9474f
Tool services - dynamically pluggable tool implementations for agent frameworks (#658)
* New schema

* Tool service implementation

* Base class

* Joke service, for testing

* Update unit tests for tool services
2026-03-04 14:51:32 +00:00
cybermaggedon
0b83c08ae4
Use model in Azure LLM integration (#657) 2026-03-04 12:06:06 +00:00