trustgraph

mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-06-09 14:55:13 +02:00

Author	SHA1	Message	Date
Cyber MacGeddon	f02bbdb442	New CLA workflow: Uses a github action in trustgraph-ai/contributor-license-agreement This blocks a PR until the commiter responds with a message of agreement with the CLA terms.	2026-03-26 14:09:07 +00:00
cybermaggedon	4164ef1c47	Add GATEWAY_SECRET support for MCP server to API gateway auth (#721 ) Pass bearer token from GATEWAY_SECRET environment variable as a URL query parameter on websocket connections to the API gateway. When unset or empty, no auth is applied (backwards compatible).	2026-03-26 10:49:28 +00:00
cybermaggedon	97f5645ea0	CLA (#716 ) Explanatory text for the CLA process	2026-03-26 09:08:09 +00:00
cybermaggedon	1f67fc2312	master -> release/v2.2 (#713 ) Merge doc updates from master into release branch	2026-03-25 17:53:20 +00:00
cybermaggedon	9330730afb	Add chunk content ID to explain trace provenance output (#708 ) When --show-provenance is used with tg-show-explain-trace, display the chunk URI on a Content: line below each Source: chain. This allows the user to easily fetch the source text with tg-get-document-content.	2026-03-23 16:20:52 +00:00
cybermaggedon	25995d03f4	Fix stray log messages caused by librarian messages (#706 ) Warning generated by librarian responses meant for other services (chunker, embeddings, etc.) arriving on the shared response queue. The decoder's subscription picks them up, can't match them to a pending request, and logs a warning. Removed the warnings, as not serving a purpose.	2026-03-23 13:16:39 +00:00
cybermaggedon	5c6fe90fe2	Add universal document decoder with multi-format support (#705 ) Add universal document decoder with multi-format support using 'unstructured'. New universal decoder service powered by the unstructured library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF, ODT, EPUB and more through a single service. Tables are preserved as HTML markup for better downstream extraction. Images are stored in the librarian but excluded from the text pipeline. Configurable section grouping strategies (whole-document, heading, element-type, count, size) for non-page formats. Page-based formats (PDF, PPTX, XLSX) are automatically grouped by page. All four decoders (PDF, Mistral OCR, Tesseract OCR, universal) now share the "document-decoder" ident so they are interchangeable. PDF-only decoders fetch document metadata to check MIME type and gracefully skip unsupported formats. Librarian changes: removed MIME type whitelist validation so any document format can be ingested. Simplified routing so text/plain goes to text-load and everything else goes to document-load. Removed dual inline/streaming data paths — documents always use document_id for content retrieval. New provenance entity types (tg:Section, tg:Image) and metadata predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for richer explainability. Universal decoder is in its own package (trustgraph-unstructured) and container image (trustgraph-unstructured).	2026-03-23 12:56:35 +00:00
cybermaggedon	4609424afe	Prepare 2.2 release branch (#704 )	2026-03-22 15:23:23 +00:00
cybermaggedon	96fd1eab15	Use UUID-based URNs for page and chunk IDs (#703 ) Page and chunk document IDs were deterministic ({doc_id}/p{num}, {doc_id}/p{num}/c{num}), causing "Document already exists" errors when reprocessing documents through different flows. Content may differ between runs due to different parameters or extractors, so deterministic IDs are incorrect. Pages now use urn:page:{uuid}, chunks use urn:chunk:{uuid}. Parent- child relationships are tracked via librarian metadata and provenance triples. Also brings Mistral OCR and Tesseract OCR decoders up to parity with the PDF decoder: librarian fetch/save support, per-page output with unique IDs, and provenance triple emission. Fixes Mistral OCR bug where only the first 5 pages were processed.	2026-03-21 21:17:03 +00:00
cybermaggedon	1a7b654bd3	Add semantic pre-filter for GraphRAG edge scoring (#702 ) Embed edge descriptions and compute cosine similarity against grounding concepts to reduce the number of edges sent to expensive LLM scoring. Controlled by edge_score_limit parameter (default 30), skipped when edge count is already below the limit. Also plumbs edge_score_limit and edge_limit parameters end-to-end: - CLI args (--edge-score-limit, --edge-limit) in both invoke and service - Socket client: fix parameter mapping to use hyphenated wire-format keys - Flow API, message translator, gateway all pass through correctly - Explainable code path (_question_explainable_api) now forwards all params - Default edge_score_limit changed from 50 to 30 based on typical subgraph sizes	2026-03-21 20:06:29 +00:00
cybermaggedon	bc68738c37	README.md from master (#701 )	2026-03-17 20:54:04 +00:00
cybermaggedon	664d1d0384	Update API specs for 2.1 (#699 ) * Updating API specs for 2.1 * Updated API and SDK docs	2026-03-17 20:36:31 +00:00
cybermaggedon	c387670944	Fix incorrect property names in explainability (#698 ) Remove type suffixes from explainability dataclass fields + fix show_explain_trace Rename dataclass fields to match KG property naming conventions: - Analysis: thought_uri/observation_uri → thought/observation - Synthesis/Conclusion/Reflection: document_uri → document Fix show_explain_trace for current API: - Resolve document content via librarian fetch instead of removed inline content fields (synthesis.content, conclusion.answer) - Add Grounding display for DocRAG traces - Update fetch_docrag_trace chain: Question → Grounding → Exploration → Synthesis - Pass api/explain_client to all print functions for content resolution Update all CLI tools and tests for renamed fields.	2026-03-16 14:47:37 +00:00
cybermaggedon	a115ec06ab	Enhance retrieval pipelines: 4-stage GraphRAG, DocRAG grounding (#697 ) Enhance retrieval pipelines: 4-stage GraphRAG, DocRAG grounding, consistent PROV-O GraphRAG: - Split retrieval into 4 prompt stages: extract-concepts, kg-edge-scoring, kg-edge-reasoning, kg-synthesis (was single-stage) - Add concept extraction (grounding) for per-concept embedding - Filter main query to default graph, ignoring provenance/explainability edges - Add source document edges to knowledge graph DocumentRAG: - Add grounding step with concept extraction, matching GraphRAG's pattern: Question → Grounding → Exploration → Synthesis - Per-concept embedding and chunk retrieval with deduplication Cross-pipeline: - Make PROV-O derivation links consistent: wasGeneratedBy for first entity from Activity, wasDerivedFrom for entity-to-entity chains - Update CLIs (tg-invoke-agent, tg-invoke-graph-rag, tg-invoke-document-rag) for new explainability structure - Fix all affected unit and integration tests	2026-03-16 12:12:13 +00:00
cybermaggedon	29b4300808	Updated test suite for explainability & provenance (#696 ) * Provenance tests * Embeddings tests * Test librarian * Test triples stream * Test concurrency * Entity centric graph writes * Agent tool service tests * Structured data tests * RDF tests * Addition LLM tests * Reliability tests	2026-03-13 14:27:42 +00:00
cybermaggedon	e6623fc915	Remove schema:subjectOf edges from KG extraction (#695 ) The subjectOf triples were redundant with the subgraph provenance model introduced in `e8407b34`. Entity-to-source lineage can be traced via tg:contains -> subgraph -> prov:wasDerivedFrom -> chunk, making the direct subjectOf edges unnecessary metadata polluting the knowledge graph. Removed from all three extractors (agent, definitions, relationships), cleaned up the SUBJECT_OF constant and vocabulary label, and updated tests accordingly.	2026-03-13 12:11:21 +00:00
cybermaggedon	64e3f6bd0d	Subgraph provenance (#694 ) Replace per-triple provenance reification with subgraph model Extraction provenance previously created a full reification (statement URI, activity, agent) for every single extracted triple, producing ~13 provenance triples per knowledge triple. Since each chunk is processed by a single LLM call, this was both redundant and semantically inaccurate. Now one subgraph object is created per chunk extraction, with tg:contains linking to each extracted triple. For 20 extractions from a chunk this reduces provenance from ~260 triples to ~33. - Rename tg:reifies -> tg:contains, stmt_uri -> subgraph_uri - Replace triple_provenance_triples() with subgraph_provenance_triples() - Refactor kg-extract-definitions and kg-extract-relationships to generate provenance once per chunk instead of per triple - Add subgraph provenance to kg-extract-ontology and kg-extract-agent (previously had none) - Update CLI tools and tech specs to match Also rename tg-show-document-hierarchy to tg-show-extraction-provenance. Added extra typing for extraction provenance, fixed extraction prov CLI	2026-03-13 11:37:59 +00:00
cybermaggedon	35128ff019	Add unified explainability support and librarian storage for (#693 ) Add unified explainability support and librarian storage for all retrieval engines Implements consistent explainability/provenance tracking across GraphRAG, DocumentRAG, and Agent retrieval engines. All large content (answers, thoughts, observations) is now stored in librarian rather than as inline literals in the knowledge graph. Explainability API: - New explainability.py module with entity classes (Question, Exploration, Focus, Synthesis, Analysis, Conclusion) and ExplainabilityClient - Quiescence-based eventual consistency handling for trace fetching - Content fetching from librarian with retry logic CLI updates: - tg-invoke-graph-rag -x/--explainable flag returns explain_id - tg-invoke-document-rag -x/--explainable flag returns explain_id - tg-invoke-agent -x/--explainable flag returns explain_id - tg-list-explain-traces uses new explainability API - tg-show-explain-trace handles all three trace types Agent provenance: - Records session, iterations (think/act/observe), and conclusion - Stores thoughts and observations in librarian with document references - New predicates: tg:thoughtDocument, tg:observationDocument DocumentRAG provenance: - Records question, exploration (chunk retrieval), and synthesis - Stores answers in librarian with document references Schema changes: - AgentResponse: added explain_id, explain_graph fields - RetrievalResponse: added explain_id, explain_graph fields - agent_iteration_triples: supports thought_document_id, observation_document_id Update tests.	2026-03-12 21:40:09 +00:00
cybermaggedon	aecf00f040	Minor agent tweaks (#692 ) Update RAG and Agent clients for streaming message handling GraphRAG now sends multiple message types in a stream: - 'explain' messages with explain_id and explain_graph for provenance - 'chunk' messages with response text fragments - end_of_session marker for stream completion Updated all clients to handle this properly: CLI clients (trustgraph-base/trustgraph/clients/): - graph_rag_client.py: Added chunk_callback and explain_callback - document_rag_client.py: Added chunk_callback and explain_callback - agent_client.py: Added think, observe, answer_callback, error_callback Internal clients (trustgraph-base/trustgraph/base/): - graph_rag_client.py: Async callbacks for streaming - agent_client.py: Async callbacks for streaming All clients now: - Route messages by chunk_type/message_type - Stream via optional callbacks for incremental delivery - Wait for proper completion signals (end_of_dialog/end_of_session/end_of_stream) - Accumulate and return complete response for callers not using callbacks Updated callers: - extract/kg/agent/extract.py: Uses new invoke(question=...) API - tests/integration/test_agent_kg_extraction_integration.py: Updated mocks This fixes the agent infinite loop issue where knowledge_query was returning the first 'explain' message (empty response) instead of waiting for the actual answer chunks. Concurrency in triples query	2026-03-12 17:59:02 +00:00
cybermaggedon	45e6ad4abc	Fix ontology RAG pipeline + add query concurrency (#691 ) - Fix ontology RAG pipeline: embeddings API, chunker provenance, and query concurrency - Fix ontology embeddings to use correct response shape from embed() API (returns list of vectors, not list of list of vectors). - Simplify chunker URI logic to append /c{index} to parent ID instead of parsing page/doc URI structure which was fragile. - Add provenance tracking and librarian integration to token chunker, matching recursive chunker capabilities. - Add configurable concurrency (default 10) to Cassandra, Qdrant, and embeddings query services.	2026-03-12 11:34:42 +00:00
cybermaggedon	312174eb88	Adding explainability to the ReACT agent (#689 ) * Added tech spec * Add provenance recording to React agent loop Enables agent sessions to be traced and debugged using the same explainability infrastructure as GraphRAG. Agent traces record: - Session start with query and timestamp - Each iteration's thought, action, arguments, and observation - Final answer with derivation chain Changes: - Add session_id and collection fields to AgentRequest schema - Add agent predicates (TG_THOUGHT, TG_ACTION, etc.) to namespaces - Create agent provenance triple generators in provenance/agent.py - Register explainability producer in agent service - Emit provenance triples during agent execution - Update CLI tools to detect and render agent traces alongside GraphRAG * Updated explainability taxonomy: GraphRAG: tg:Question → tg:Exploration → tg:Focus → tg:Synthesis Agent: tg:Question → tg:Analysis(s) → tg:Conclusion All entities also have their PROV-O type (prov:Activity or prov:Entity). Updated commit message: Add provenance recording to React agent loop Enables agent sessions to be traced and debugged using the same explainability infrastructure as GraphRAG. Entity types follow human reasoning patterns: - tg:Question - the user's query (shared with GraphRAG) - tg:Analysis - each think/act/observe cycle - tg:Conclusion - the final answer Also adds explicit TG types to GraphRAG entities: - tg:Question, tg:Exploration, tg:Focus, tg:Synthesis All types retain their PROV-O base types (prov:Activity, prov:Entity). Changes: - Add session_id and collection fields to AgentRequest schema - Add explainability entity types to namespaces.py - Create agent provenance triple generators - Register explainability producer in agent service - Emit provenance triples during agent execution - Update CLI tools to detect and render both trace types * Document RAG explainability is now complete. Here's a summary of the changes made: Schema Changes: - trustgraph-base/trustgraph/schema/services/retrieval.py: Added explain_id and explain_graph fields to DocumentRagResponse - trustgraph-base/trustgraph/messaging/translators/retrieval.py: Updated translator to handle explainability fields Provenance Changes: - trustgraph-base/trustgraph/provenance/namespaces.py: Added TG_CHUNK_COUNT and TG_SELECTED_CHUNK predicates - trustgraph-base/trustgraph/provenance/uris.py: Added docrag_question_uri, docrag_exploration_uri, docrag_synthesis_uri generators - trustgraph-base/trustgraph/provenance/triples.py: Added docrag_question_triples, docrag_exploration_triples, docrag_synthesis_triples builders - trustgraph-base/trustgraph/provenance/__init__.py: Exported all new Document RAG functions and predicates Service Changes: - trustgraph-flow/trustgraph/retrieval/document_rag/document_rag.py: Added explainability callback support and triple emission at each phase (Question → Exploration → Synthesis) - trustgraph-flow/trustgraph/retrieval/document_rag/rag.py: Registered explainability producer and wired up the callback Documentation: - docs/tech-specs/agent-explainability.md: Added Document RAG entity types and provenance model documentation Document RAG Provenance Model: Question (urn:trustgraph:docrag:{uuid}) │ │ tg:query, prov:startedAtTime │ rdf:type = prov:Activity, tg:Question │ ↓ prov:wasGeneratedBy │ Exploration (urn:trustgraph:docrag:{uuid}/exploration) │ │ tg:chunkCount, tg:selectedChunk (multiple) │ rdf:type = prov:Entity, tg:Exploration │ ↓ prov:wasDerivedFrom │ Synthesis (urn:trustgraph:docrag:{uuid}/synthesis) │ │ tg:content = "The answer..." │ rdf:type = prov:Entity, tg:Synthesis * Specific subtype that makes the retrieval mechanism immediately obvious: System: GraphRAG TG Types on Question: tg:Question, tg:GraphRagQuestion URI Pattern: urn:trustgraph:question:{uuid} ──────────────────────────────────────── System: Document RAG TG Types on Question: tg:Question, tg:DocRagQuestion URI Pattern: urn:trustgraph:docrag:{uuid} ──────────────────────────────────────── System: Agent TG Types on Question: tg:Question, tg:AgentQuestion URI Pattern: urn:trustgraph:agent:{uuid} Files modified: - trustgraph-base/trustgraph/provenance/namespaces.py - Added TG_GRAPH_RAG_QUESTION, TG_DOC_RAG_QUESTION, TG_AGENT_QUESTION - trustgraph-base/trustgraph/provenance/triples.py - Added subtype to question_triples and docrag_question_triples - trustgraph-base/trustgraph/provenance/agent.py - Added subtype to agent_session_triples - trustgraph-base/trustgraph/provenance/__init__.py - Exported new types - docs/tech-specs/agent-explainability.md - Documented the subtypes This allows: - Query all questions: ?q rdf:type tg:Question - Query only GraphRAG: ?q rdf:type tg:GraphRagQuestion - Query only Document RAG: ?q rdf:type tg:DocRagQuestion - Query only Agent: ?q rdf:type tg:AgentQuestion * Fixed tests	2026-03-11 15:28:15 +00:00
cybermaggedon	a53ed41da2	Add explainability CLI tools (#688 ) Add explainability CLI tools for debugging provenance data - tg-show-document-hierarchy: Display document → page → chunk → edge hierarchy by traversing prov:wasDerivedFrom relationships - tg-list-explain-traces: List all GraphRAG sessions with questions and timestamps from the retrieval graph - tg-show-explain-trace: Show full explainability cascade for a GraphRAG session (question → exploration → focus → synthesis) These tools query the source and retrieval graphs to help debug and explore provenance/explainability data stored during document processing and GraphRAG queries.	2026-03-11 13:44:29 +00:00
cybermaggedon	fda508fdae	Lock mistralai to <2.0.0, avoids a breaking change (#687 )	2026-03-11 12:19:04 +00:00
cybermaggedon	286f762369	The id field in pipeline Metadata was being overwritten at each processing (#686 ) The id field in pipeline Metadata was being overwritten at each processing stage (document → page → chunk), causing knowledge storage to create separate cores per chunk instead of grouping by document. Add a root field that: - Is set by librarian to the original document ID - Is copied unchanged through PDF decoder, chunkers, and extractors - Is used by knowledge storage for document_id grouping (with fallback to id) Changes: - Add root field to Metadata schema with empty string default - Set root=document.id in librarian when initiating document processing - Copy root through PDF decoder, recursive chunker, and all extractors - Update knowledge storage to use root (or id as fallback) for grouping - Add root handling to translators and gateway serialization - Update test mock Metadata class to include root parameter	2026-03-11 12:16:39 +00:00
cybermaggedon	aa4f5c6c00	Remove redundant metadata (#685 ) The metadata field (list of triples) in the pipeline Metadata class was redundant. Document metadata triples already flow directly from librarian to triple-store via emit_document_provenance() - they don't need to pass through the extraction pipeline. Additionally, chunker and PDF decoder were overwriting metadata to [] anyway, so any metadata passed through the pipeline was being discarded. Changes: - Remove metadata field from Metadata dataclass (schema/core/metadata.py) - Update all Metadata instantiations to remove metadata=[] parameter - Remove metadata handling from translators (document_loading, knowledge) - Remove metadata consumption from extractors (ontology, agent) - Update gateway serializers and import handlers - Update all unit, integration, and contract tests	2026-03-11 10:51:39 +00:00
cybermaggedon	1837d73f34	Dataflow tech spec for extraction (#684 )	2026-03-11 10:49:50 +00:00
cybermaggedon	ca525b679f	Filter answers from document lists (#683 )	2026-03-10 17:22:19 +00:00
cybermaggedon	e1bc4c04a4	Terminology Rename, and named-graphs for explainability (#682 ) Terminology Rename, and named-graphs for explainability data Changed terminology: - session -> question - retrieval -> exploration - selection -> focus - answer -> synthesis - uris.py: Renamed query_session_uri → question_uri, retrieval_uri → exploration_uri, selection_uri → focus_uri, answer_uri → synthesis_uri - triples.py: Renamed corresponding triple generation functions with updated labels ("GraphRAG question", "Exploration", "Focus", "Synthesis") - namespaces.py: Added named graph constants GRAPH_DEFAULT, GRAPH_SOURCE, GRAPH_RETRIEVAL - init.py: Updated exports - graph_rag.py: Updated to use new terminology - invoke_graph_rag.py: Updated CLI to display new stage names (Question, Exploration, Focus, Synthesis) Query-Time Explainability → Named Graph - triples.py: Added set_graph() helper function to set named graph on triples - graph_rag.py: All explainability triples now use GRAPH_RETRIEVAL named graph - rag.py: Explainability triples stored in user's collection (not separate collection) with named graph Extraction Provenance → Named Graph - relationships/extract.py: Provenance triples use GRAPH_SOURCE named graph - definitions/extract.py: Provenance triples use GRAPH_SOURCE named graph - chunker.py: Provenance triples use GRAPH_SOURCE named graph - pdf_decoder.py: Provenance triples use GRAPH_SOURCE named graph CLI Updates - show_graph.py: Added -g/--graph option to filter by named graph and --show-graph to display graph column Also: - Fix knowledge core schemas	2026-03-10 14:35:21 +00:00
cybermaggedon	57eda65674	Knowledge core processing updated for embeddings interface change (#681 ) Knowledge core fixed: - trustgraph-flow/trustgraph/tables/knowledge.py - v.vector, v.chunk_id - trustgraph-base/trustgraph/messaging/translators/document_loading.py - chunk.vector - trustgraph-base/trustgraph/messaging/translators/knowledge.py - entity.vector - trustgraph-flow/trustgraph/gateway/dispatch/serialize.py - entity.vector, chunk.vector Test fixtures fixed: - tests/unit/test_storage/conftest.py - All mock entities/chunks use vector - tests/unit/test_query/conftest.py - All mock requests use vector - tests/unit/test_query/test_doc_embeddings_pinecone_query.py - All mock messages use vector These changes align with commit `f2ae0e86` which changed the schema from vectors: list[list[float]] to vector: list[float].	2026-03-10 13:28:16 +00:00
cybermaggedon	84941ce645	Fix Cassandra schema and graph filter semantics (#680 ) Schema fix (dtype/lang clustering key): - Add dtype and lang to PRIMARY KEY in quads_by_entity table - Add otype, dtype, lang to PRIMARY KEY in quads_by_collection table - Fixes deduplication bug where literals with same value but different datatype or language tag were collapsed (e.g., "thing" vs "thing"@en) - Update delete_collection to pass new clustering columns - Update tech spec to reflect new schema Graph filter semantics (simplified, no wildcard constant): - g=None means all graphs (no filter) - g="" means default graph only - g="uri" means specific named graph - Remove GRAPH_WILDCARD usage from EntityCentricKnowledgeGraph - Fix service.py streaming and non-streaming paths - Fix CLI to preserve empty string for -g '' argument	2026-03-10 12:52:51 +00:00
cybermaggedon	c951562189	Graph query CLI tool (#679 ) New CLI tool that enables selective queries against the triple store unlike tg-show-graph which dumps the entire graph. Features: - Filter by subject, predicate, object, and/or named graph - Auto-detection of term types (IRI, literal, quoted triple) - Two ways to specify quoted triples: - Inline Turtle-style: -o "<<s p o>>" - Explicit flags: --qt-subject, --qt-predicate, --qt-object - Output formats: space-separated, pipe-separated, JSON, JSON Lines - Streaming mode for efficient large result sets Auto-detection rules: - http://, https://, urn:, or <wrapped> -> IRI - <<s p o>> -> quoted triple - Otherwise -> literal	2026-03-10 11:03:34 +00:00
cybermaggedon	ec83775789	Update tech spec (#678 )	2026-03-10 10:07:37 +00:00
cybermaggedon	7a6197d8c3	GraphRAG Query-Time Explainability (#677 ) Implements full explainability pipeline for GraphRAG queries, enabling traceability from answers back to source documents. Renamed throughout for clarity: - provenance_callback → explain_callback - provenance_id → explain_id - provenance_collection → explain_collection - message_type "provenance" → "explain" - Queue name "provenance" → "explainability" GraphRAG queries now emit explainability events as they execute: 1. Session - query text and timestamp 2. Retrieval - edges retrieved from subgraph 3. Selection - selected edges with LLM reasoning (JSONL with id + reasoning) 4. Answer - reference to synthesized response Events stream via explain_callback during query(), enabling real-time UX. - Answers stored in librarian service (not inline in graph - too large) - Document ID as URN: urn:trustgraph:answer:{session_id} - Graph stores tg:document reference (IRI) to librarian document - Added librarian producer/consumer to graph-rag service - get_labelgraph() now returns (labeled_edges, uri_map) - uri_map maps edge_id(label_s, label_p, label_o) → (uri_s, uri_p, uri_o) - Explainability data stores original URIs, not labels - Enables tracing edges back to reifying statements via tg:reifies - Added serialize_triple() to query service (matches storage format) - get_term_value() now handles TRIPLE type terms - Enables querying by quoted triple in object position: ?stmt tg:reifies <<s p o>> - Displays real-time explainability events during query - Resolves rdfs:label for edge components (s, p, o) - Traces source chain via prov:wasDerivedFrom to root document - Output: "Source: Chunk 1 → Page 2 → Document Title" - Label caching to avoid repeated queries GraphRagResponse: - explain_id: str \| None - explain_collection: str \| None - message_type: str ("chunk" or "explain") - end_of_session: bool trustgraph-base/trustgraph/provenance/: - namespaces.py - Added TG_DOCUMENT predicate - triples.py - answer_triples() supports document_id reference - uris.py - Added edge_selection_uri() trustgraph-base/trustgraph/schema/services/retrieval.py: - GraphRagResponse with explain_id, explain_collection, end_of_session trustgraph-flow/trustgraph/retrieval/graph_rag/: - graph_rag.py - URI preservation, streaming answer accumulation - rag.py - Librarian integration, real-time explain emission trustgraph-flow/trustgraph/query/triples/cassandra/service.py: - Quoted triple serialization for query matching trustgraph-cli/trustgraph/cli/invoke_graph_rag.py: - Full explainability display with label resolution and source tracing	2026-03-10 10:00:01 +00:00
cybermaggedon	d2d71f859d	Feature/streaming triples (#676 ) * Steaming triples * Also GraphRAG service uses this * Updated tests	2026-03-09 15:46:33 +00:00
cybermaggedon	3c3e11bef5	Fix/librarian broken (#674 ) * Set end-of-stream cleanly - clean streaming message structures * Add tg-get-document-content	2026-03-09 13:36:24 +00:00
cybermaggedon	df1808768d	Fix/doc streaming proto (#673 ) * Librarian streaming doc download * Document stream download endpoint	2026-03-09 12:36:10 +00:00
cybermaggedon	b2ef7bbb8c	Fix doc embeddings invocation (#672 ) * Fix doc embeddings invocation * Tidy query embeddings invocation	2026-03-09 11:07:32 +00:00
cybermaggedon	f2ae0e8623	Embeddings API scores (#671 ) - Put scores in all responses - Remove unused 'middle' vector layer. Vector of texts -> vector of (vector embedding)	2026-03-09 10:53:44 +00:00
cybermaggedon	4fa7cc7d7c	Fix/embeddings integration 2 (#670 )	2026-03-08 19:42:26 +00:00
cybermaggedon	919b760c05	Update embeddings integration for new batch embeddings interfaces (#669 ) * Fix vector extraction * Fix embeddings integration	2026-03-08 19:41:52 +00:00
cybermaggedon	0a2ce47a88	Batch embeddings (#668 ) Base Service (trustgraph-base/trustgraph/base/embeddings_service.py): - Changed on_request to use request.texts FastEmbed Processor (trustgraph-flow/trustgraph/embeddings/fastembed/processor.py): - on_embeddings(texts, model=None) now processes full batch efficiently - Returns [[v.tolist()] for v in vecs] - list of vector sets Ollama Processor (trustgraph-flow/trustgraph/embeddings/ollama/processor.py): - on_embeddings(texts, model=None) passes list directly to Ollama - Returns [[embedding] for embedding in embeds.embeddings] EmbeddingsClient (trustgraph-base/trustgraph/base/embeddings_client.py): - embed(texts, timeout=300) accepts list of texts Tests Updated: - test_fastembed_dynamic_model.py - 4 tests updated for new interface - test_ollama_dynamic_model.py - 4 tests updated for new interface Updated CLI, SDK and APIs	2026-03-08 18:36:54 +00:00
cybermaggedon	3bf8a65409	Fix tests (#666 )	2026-03-07 23:38:09 +00:00
cybermaggedon	24bbe94136	Document chunks not stored in vector store (#665 ) - Schema - ChunkEmbeddings now uses chunk_id: str instead of chunk: bytes - Schema - DocumentEmbeddingsResponse now returns chunk_ids: list[str] instead of chunks - Translators - Updated to serialize/deserialize chunk_id - Clients - DocumentEmbeddingsClient.query() returns chunk_ids - SDK/API - flow.py, socket_client.py, bulk_client.py updated - Document embeddings service - Stores chunk_id (document ID) instead of chunk text - Storage writers - Qdrant, Milvus, Pinecone store chunk_id in payload - Query services - Return chunk_id from vector store searches - Gateway dispatchers - Serialize chunk_id in API responses - Document RAG - Added librarian client to fetch chunk content from Garage using chunk_ids - CLI tools - Updated all three tools: - invoke_document_embeddings.py - displays chunk_ids, removed max_chunk_length - save_doc_embeds.py - exports chunk_id - load_doc_embeds.py - imports chunk_id	2026-03-07 23:10:45 +00:00
cybermaggedon	be358efe67	Fix tests (#663 )	2026-03-06 12:40:02 +00:00
cybermaggedon	2b9232917c	Fix/extraction prov (#662 ) Quoted triple fixes, including... 1. Updated triple_provenance_triples() in triples.py: - Now accepts a Triple object directly - Creates the reification triple using TRIPLE term type: stmt_uri tg:reifies <<extracted_triple>> - Includes it in the returned provenance triples 2. Updated definitions extractor: - Added imports for provenance functions and component version - Added ParameterSpec for optional llm-model and ontology flow parameters - For each definition triple, generates provenance with reification 3. Updated relationships extractor: - Same changes as definitions extractor	2026-03-06 12:23:58 +00:00
cybermaggedon	cd5580be59	Extract-time provenance (#661 ) 1. Shared Provenance Module - URI generators, namespace constants, triple builders, vocabulary bootstrap 2. Librarian - Emits document metadata to graph on processing initiation (vocabulary bootstrap + PROV-O triples) 3. PDF Extractor - Saves pages as child documents, emits parent-child provenance edges, forwards page IDs 4. Chunker - Saves chunks as child documents, emits provenance edges, forwards chunk ID + content 5. Knowledge Extractors (both definitions and relationships): - Link entities to chunks via SUBJECT_OF (not top-level document) - Removed duplicate metadata emission (now handled by librarian) - Get chunk_doc_id and chunk_uri from incoming Chunk message 6. Embedding Provenance: - EntityContext schema has chunk_id field - EntityEmbeddings schema has chunk_id field - Definitions extractor sets chunk_id when creating EntityContext - Graph embeddings processor passes chunk_id through to EntityEmbeddings Provenance Flow: Document → Page (PDF) → Chunk → Extracted Facts/Embeddings ↓ ↓ ↓ ↓ librarian librarian librarian (chunk_id reference) + graph + graph + graph Each artifact is stored in librarian with parent-child linking, and PROV-O edges are emitted to the knowledge graph for full traceability from any extracted fact back to its source document. Also, updating tests	2026-03-05 18:36:10 +00:00
cybermaggedon	d8f0a576af	Document API updates (#660 ) * Doc streaming from librarian * Fix chunk minimum confusion * Add CLI args	2026-03-05 15:20:45 +00:00
cybermaggedon	a630e143ef	Incremental / large document loading (#659 ) Tech spec BlobStore (trustgraph-flow/trustgraph/librarian/blob_store.py): - get_stream() - yields document content in chunks for streaming retrieval - create_multipart_upload() - initializes S3 multipart upload, returns upload_id - upload_part() - uploads a single part, returns etag - complete_multipart_upload() - finalizes upload with part etags - abort_multipart_upload() - cancels and cleans up Cassandra schema (trustgraph-flow/trustgraph/tables/library.py): - New upload_session table with 24-hour TTL - Index on user for listing sessions - Prepared statements for all operations - Methods: create_upload_session(), get_upload_session(), update_upload_session_chunk(), delete_upload_session(), list_upload_sessions() - Schema extended with UploadSession, UploadProgress, and new request/response fields - Librarian methods: begin_upload, upload_chunk, complete_upload, abort_upload, get_upload_status, list_uploads - Service routing for all new operations - Python SDK with transparent chunked upload: - add_document() auto-switches to chunked for files > 10MB - Progress callback support (on_progress) - get_pending_uploads(), get_upload_status(), abort_upload(), resume_upload() - Document table: Added parent_id and document_type columns with index - Document schema (knowledge/document.py): Added document_id field for streaming retrieval - Librarian operations: - add-child-document for extracted PDF pages - list-children to get child documents - stream-document for chunked content retrieval - Cascade delete removes children when parent is deleted - list-documents filters children by default - PDF decoder (decoding/pdf/pdf_decoder.py): Updated to stream large documents from librarian API to temp file - Librarian service (librarian/service.py): Sends document_id instead of content for large PDFs (>2MB) - Deprecated tools (load_pdf.py, load_text.py): Added deprecation warnings directing users to tg-add-library-document + tg-start-library-processing Remove load_pdf and load_text utils Move chunker/librarian comms to base class Updating tests	2026-03-04 16:57:58 +00:00
cybermaggedon	a38ca9474f	Tool services - dynamically pluggable tool implementations for agent frameworks (#658 ) * New schema * Tool service implementation * Base class * Joke service, for testing * Update unit tests for tool services	2026-03-04 14:51:32 +00:00
cybermaggedon	0b83c08ae4	Use model in Azure LLM integration (#657 )	2026-03-04 12:06:06 +00:00

1 2 3 4 5 ...

1051 commits