trustgraph

mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-06-10 23:35:14 +02:00

Author	SHA1	Message	Date
cybermaggedon	286f762369	The id field in pipeline Metadata was being overwritten at each processing (#686 ) The id field in pipeline Metadata was being overwritten at each processing stage (document → page → chunk), causing knowledge storage to create separate cores per chunk instead of grouping by document. Add a root field that: - Is set by librarian to the original document ID - Is copied unchanged through PDF decoder, chunkers, and extractors - Is used by knowledge storage for document_id grouping (with fallback to id) Changes: - Add root field to Metadata schema with empty string default - Set root=document.id in librarian when initiating document processing - Copy root through PDF decoder, recursive chunker, and all extractors - Update knowledge storage to use root (or id as fallback) for grouping - Add root handling to translators and gateway serialization - Update test mock Metadata class to include root parameter	2026-03-11 12:16:39 +00:00
cybermaggedon	aa4f5c6c00	Remove redundant metadata (#685 ) The metadata field (list of triples) in the pipeline Metadata class was redundant. Document metadata triples already flow directly from librarian to triple-store via emit_document_provenance() - they don't need to pass through the extraction pipeline. Additionally, chunker and PDF decoder were overwriting metadata to [] anyway, so any metadata passed through the pipeline was being discarded. Changes: - Remove metadata field from Metadata dataclass (schema/core/metadata.py) - Update all Metadata instantiations to remove metadata=[] parameter - Remove metadata handling from translators (document_loading, knowledge) - Remove metadata consumption from extractors (ontology, agent) - Update gateway serializers and import handlers - Update all unit, integration, and contract tests	2026-03-11 10:51:39 +00:00
cybermaggedon	e1bc4c04a4	Terminology Rename, and named-graphs for explainability (#682 ) Terminology Rename, and named-graphs for explainability data Changed terminology: - session -> question - retrieval -> exploration - selection -> focus - answer -> synthesis - uris.py: Renamed query_session_uri → question_uri, retrieval_uri → exploration_uri, selection_uri → focus_uri, answer_uri → synthesis_uri - triples.py: Renamed corresponding triple generation functions with updated labels ("GraphRAG question", "Exploration", "Focus", "Synthesis") - namespaces.py: Added named graph constants GRAPH_DEFAULT, GRAPH_SOURCE, GRAPH_RETRIEVAL - init.py: Updated exports - graph_rag.py: Updated to use new terminology - invoke_graph_rag.py: Updated CLI to display new stage names (Question, Exploration, Focus, Synthesis) Query-Time Explainability → Named Graph - triples.py: Added set_graph() helper function to set named graph on triples - graph_rag.py: All explainability triples now use GRAPH_RETRIEVAL named graph - rag.py: Explainability triples stored in user's collection (not separate collection) with named graph Extraction Provenance → Named Graph - relationships/extract.py: Provenance triples use GRAPH_SOURCE named graph - definitions/extract.py: Provenance triples use GRAPH_SOURCE named graph - chunker.py: Provenance triples use GRAPH_SOURCE named graph - pdf_decoder.py: Provenance triples use GRAPH_SOURCE named graph CLI Updates - show_graph.py: Added -g/--graph option to filter by named graph and --show-graph to display graph column Also: - Fix knowledge core schemas	2026-03-10 14:35:21 +00:00
cybermaggedon	57eda65674	Knowledge core processing updated for embeddings interface change (#681 ) Knowledge core fixed: - trustgraph-flow/trustgraph/tables/knowledge.py - v.vector, v.chunk_id - trustgraph-base/trustgraph/messaging/translators/document_loading.py - chunk.vector - trustgraph-base/trustgraph/messaging/translators/knowledge.py - entity.vector - trustgraph-flow/trustgraph/gateway/dispatch/serialize.py - entity.vector, chunk.vector Test fixtures fixed: - tests/unit/test_storage/conftest.py - All mock entities/chunks use vector - tests/unit/test_query/conftest.py - All mock requests use vector - tests/unit/test_query/test_doc_embeddings_pinecone_query.py - All mock messages use vector These changes align with commit `f2ae0e86` which changed the schema from vectors: list[list[float]] to vector: list[float].	2026-03-10 13:28:16 +00:00
cybermaggedon	84941ce645	Fix Cassandra schema and graph filter semantics (#680 ) Schema fix (dtype/lang clustering key): - Add dtype and lang to PRIMARY KEY in quads_by_entity table - Add otype, dtype, lang to PRIMARY KEY in quads_by_collection table - Fixes deduplication bug where literals with same value but different datatype or language tag were collapsed (e.g., "thing" vs "thing"@en) - Update delete_collection to pass new clustering columns - Update tech spec to reflect new schema Graph filter semantics (simplified, no wildcard constant): - g=None means all graphs (no filter) - g="" means default graph only - g="uri" means specific named graph - Remove GRAPH_WILDCARD usage from EntityCentricKnowledgeGraph - Fix service.py streaming and non-streaming paths - Fix CLI to preserve empty string for -g '' argument	2026-03-10 12:52:51 +00:00
cybermaggedon	7a6197d8c3	GraphRAG Query-Time Explainability (#677 ) Implements full explainability pipeline for GraphRAG queries, enabling traceability from answers back to source documents. Renamed throughout for clarity: - provenance_callback → explain_callback - provenance_id → explain_id - provenance_collection → explain_collection - message_type "provenance" → "explain" - Queue name "provenance" → "explainability" GraphRAG queries now emit explainability events as they execute: 1. Session - query text and timestamp 2. Retrieval - edges retrieved from subgraph 3. Selection - selected edges with LLM reasoning (JSONL with id + reasoning) 4. Answer - reference to synthesized response Events stream via explain_callback during query(), enabling real-time UX. - Answers stored in librarian service (not inline in graph - too large) - Document ID as URN: urn:trustgraph:answer:{session_id} - Graph stores tg:document reference (IRI) to librarian document - Added librarian producer/consumer to graph-rag service - get_labelgraph() now returns (labeled_edges, uri_map) - uri_map maps edge_id(label_s, label_p, label_o) → (uri_s, uri_p, uri_o) - Explainability data stores original URIs, not labels - Enables tracing edges back to reifying statements via tg:reifies - Added serialize_triple() to query service (matches storage format) - get_term_value() now handles TRIPLE type terms - Enables querying by quoted triple in object position: ?stmt tg:reifies <<s p o>> - Displays real-time explainability events during query - Resolves rdfs:label for edge components (s, p, o) - Traces source chain via prov:wasDerivedFrom to root document - Output: "Source: Chunk 1 → Page 2 → Document Title" - Label caching to avoid repeated queries GraphRagResponse: - explain_id: str \| None - explain_collection: str \| None - message_type: str ("chunk" or "explain") - end_of_session: bool trustgraph-base/trustgraph/provenance/: - namespaces.py - Added TG_DOCUMENT predicate - triples.py - answer_triples() supports document_id reference - uris.py - Added edge_selection_uri() trustgraph-base/trustgraph/schema/services/retrieval.py: - GraphRagResponse with explain_id, explain_collection, end_of_session trustgraph-flow/trustgraph/retrieval/graph_rag/: - graph_rag.py - URI preservation, streaming answer accumulation - rag.py - Librarian integration, real-time explain emission trustgraph-flow/trustgraph/query/triples/cassandra/service.py: - Quoted triple serialization for query matching trustgraph-cli/trustgraph/cli/invoke_graph_rag.py: - Full explainability display with label resolution and source tracing	2026-03-10 10:00:01 +00:00
cybermaggedon	d2d71f859d	Feature/streaming triples (#676 ) * Steaming triples * Also GraphRAG service uses this * Updated tests	2026-03-09 15:46:33 +00:00
cybermaggedon	f2ae0e8623	Embeddings API scores (#671 ) - Put scores in all responses - Remove unused 'middle' vector layer. Vector of texts -> vector of (vector embedding)	2026-03-09 10:53:44 +00:00
cybermaggedon	4fa7cc7d7c	Fix/embeddings integration 2 (#670 )	2026-03-08 19:42:26 +00:00
cybermaggedon	0a2ce47a88	Batch embeddings (#668 ) Base Service (trustgraph-base/trustgraph/base/embeddings_service.py): - Changed on_request to use request.texts FastEmbed Processor (trustgraph-flow/trustgraph/embeddings/fastembed/processor.py): - on_embeddings(texts, model=None) now processes full batch efficiently - Returns [[v.tolist()] for v in vecs] - list of vector sets Ollama Processor (trustgraph-flow/trustgraph/embeddings/ollama/processor.py): - on_embeddings(texts, model=None) passes list directly to Ollama - Returns [[embedding] for embedding in embeds.embeddings] EmbeddingsClient (trustgraph-base/trustgraph/base/embeddings_client.py): - embed(texts, timeout=300) accepts list of texts Tests Updated: - test_fastembed_dynamic_model.py - 4 tests updated for new interface - test_ollama_dynamic_model.py - 4 tests updated for new interface Updated CLI, SDK and APIs	2026-03-08 18:36:54 +00:00
cybermaggedon	3bf8a65409	Fix tests (#666 )	2026-03-07 23:38:09 +00:00
cybermaggedon	be358efe67	Fix tests (#663 )	2026-03-06 12:40:02 +00:00
cybermaggedon	cd5580be59	Extract-time provenance (#661 ) 1. Shared Provenance Module - URI generators, namespace constants, triple builders, vocabulary bootstrap 2. Librarian - Emits document metadata to graph on processing initiation (vocabulary bootstrap + PROV-O triples) 3. PDF Extractor - Saves pages as child documents, emits parent-child provenance edges, forwards page IDs 4. Chunker - Saves chunks as child documents, emits provenance edges, forwards chunk ID + content 5. Knowledge Extractors (both definitions and relationships): - Link entities to chunks via SUBJECT_OF (not top-level document) - Removed duplicate metadata emission (now handled by librarian) - Get chunk_doc_id and chunk_uri from incoming Chunk message 6. Embedding Provenance: - EntityContext schema has chunk_id field - EntityEmbeddings schema has chunk_id field - Definitions extractor sets chunk_id when creating EntityContext - Graph embeddings processor passes chunk_id through to EntityEmbeddings Provenance Flow: Document → Page (PDF) → Chunk → Extracted Facts/Embeddings ↓ ↓ ↓ ↓ librarian librarian librarian (chunk_id reference) + graph + graph + graph Each artifact is stored in librarian with parent-child linking, and PROV-O edges are emitted to the knowledge graph for full traceability from any extracted fact back to its source document. Also, updating tests	2026-03-05 18:36:10 +00:00
cybermaggedon	a630e143ef	Incremental / large document loading (#659 ) Tech spec BlobStore (trustgraph-flow/trustgraph/librarian/blob_store.py): - get_stream() - yields document content in chunks for streaming retrieval - create_multipart_upload() - initializes S3 multipart upload, returns upload_id - upload_part() - uploads a single part, returns etag - complete_multipart_upload() - finalizes upload with part etags - abort_multipart_upload() - cancels and cleans up Cassandra schema (trustgraph-flow/trustgraph/tables/library.py): - New upload_session table with 24-hour TTL - Index on user for listing sessions - Prepared statements for all operations - Methods: create_upload_session(), get_upload_session(), update_upload_session_chunk(), delete_upload_session(), list_upload_sessions() - Schema extended with UploadSession, UploadProgress, and new request/response fields - Librarian methods: begin_upload, upload_chunk, complete_upload, abort_upload, get_upload_status, list_uploads - Service routing for all new operations - Python SDK with transparent chunked upload: - add_document() auto-switches to chunked for files > 10MB - Progress callback support (on_progress) - get_pending_uploads(), get_upload_status(), abort_upload(), resume_upload() - Document table: Added parent_id and document_type columns with index - Document schema (knowledge/document.py): Added document_id field for streaming retrieval - Librarian operations: - add-child-document for extracted PDF pages - list-children to get child documents - stream-document for chunked content retrieval - Cascade delete removes children when parent is deleted - list-documents filters children by default - PDF decoder (decoding/pdf/pdf_decoder.py): Updated to stream large documents from librarian API to temp file - Librarian service (librarian/service.py): Sends document_id instead of content for large PDFs (>2MB) - Deprecated tools (load_pdf.py, load_text.py): Added deprecation warnings directing users to tg-add-library-document + tg-start-library-processing Remove load_pdf and load_text utils Move chunker/librarian comms to base class Updating tests	2026-03-04 16:57:58 +00:00
cybermaggedon	a38ca9474f	Tool services - dynamically pluggable tool implementations for agent frameworks (#658 ) * New schema * Tool service implementation * Base class * Joke service, for testing * Update unit tests for tool services	2026-03-04 14:51:32 +00:00
cybermaggedon	7d2d59a80f	Fix/tests (#647 )	2026-02-23 22:01:47 +00:00
cybermaggedon	1809c1f56d	Structured data 2 (#645 ) * Structured data refactor - multi-index tables, remove need for manual mods to the Cassandra tables * Tech spec updated to track implementation	2026-02-23 15:56:29 +00:00
cybermaggedon	5ffad92345	Fix subscriber unexpected message causing queue clogging (#642 ) queue clogging.	2026-02-23 14:34:05 +00:00
cybermaggedon	00c1ca681b	Entity-centric graph (#633 ) * Tech spec for new entity-centric graph schema * Graph implementation	2026-02-16 13:26:43 +00:00
cybermaggedon	f24f1ebd80	Migrate to VertexAI to google-genai SDK from deprecated library (#632 ) * Migrate to VertexAI to google-genai SDK from deprecated library * Fix tests, mock the correct API	2026-02-09 20:43:33 +00:00
cybermaggedon	cf0daedefa	Changed schema for Value -> Term, majorly breaking change (#622 ) * Changed schema for Value -> Term, majorly breaking change * Following the schema change, Value -> Term into all processing * Updated Cassandra for g, p, s, o index patterns (7 indexes) * Reviewed and updated all tests * Neo4j, Memgraph and FalkorDB remain broken, will look at once settled down	2026-01-27 13:48:08 +00:00
cybermaggedon	e214eb4e02	Feature/prompts jsonl (#619 ) * Tech spec * JSONL implementation complete * Updated prompt client users * Fix tests	2026-01-26 17:38:00 +00:00
cybermaggedon	11f41b07ab	Get neo4j to use limit (#618 ) * Get neo4j to use limit * Fix tests - they we exact matching on query strings	2026-01-22 15:16:34 +00:00
cybermaggedon	62b754d788	Fix flow loading (#611 )	2026-01-14 16:23:15 +00:00
cybermaggedon	b08db761d7	Fix config inconsistency (#609 ) * Plural/singular confusion in config key * Flow class vs flow blueprint nomenclature change * Update docs & CLI to reflect the above	2026-01-14 12:31:40 +00:00
cybermaggedon	807f6cc4e2	Fix non streaming RAG problems (#607 ) * Fix non-streaming failure in RAG services * Fix non-streaming failure in API * Fix agent non-streaming messaging * Agent messaging unit & contract tests	2026-01-12 18:45:52 +00:00
cybermaggedon	53cf5fd7f9	Fix test async warnings (#601 ) * Fix tracemalloc async warnings * Comment out debug, left in for use if needed	2026-01-06 22:09:34 +00:00
cybermaggedon	f79d0603f7	Update to add streaming tests (#600 )	2026-01-06 21:48:05 +00:00
cybermaggedon	ae13190093	Address legacy issues in storage management (#595 ) * Removed legacy storage management cruft. Tidied tech specs. * Fix deletion of last collection * Storage processor ignores data on the queue which is for a deleted collection * Updated tests	2026-01-05 13:45:14 +00:00
cybermaggedon	5304f96fe6	Fix tests (#593 ) * Fix unit/integration/contract tests which were broken by messaging fabric work	2025-12-19 08:53:21 +00:00
cybermaggedon	ba95fa226b	Gateway queue overrides (#584 )	2025-12-06 11:01:20 +00:00
cybermaggedon	7d07f802a8	Basic multitenant support (#583 ) * Tech spec * Address multi-tenant queue option problems in CLI * Modified collection service to use config * Changed storage management to use the config service definition	2025-12-05 21:45:30 +00:00
cybermaggedon	789d9713a0	Fix API tests (#581 ) * Fix RAG streaming CLIs * Fixed, all tests pass	2025-12-04 21:11:56 +00:00
cybermaggedon	01aeede78b	Python API implements streaming interfaces (#577 ) * Tech spec * Python CLI utilities updated to use the API including streaming features * Added type safety to Python API * Completed missing auth token support in CLI	2025-12-04 17:38:57 +00:00
cybermaggedon	310a2deb06	Feature/streaming llm phase 1 (#566 ) * Tidy up duplicate tech specs in doc directory * Streaming LLM text-completion service tech spec. * text-completion and prompt interfaces * streaming change applied to all LLMs, so far tested with VertexAI * Skip Pinecone unit tests, upstream module issue is affecting things, tests are passing again * Added agent streaming, not working and has broken tests	2025-11-26 09:59:10 +00:00
cybermaggedon	3580e7a7ae	Remove some 'unnecessary' parameters from OpenAI invocation (#561 ) * Remove some 'unnecessary' parameters from OpenAI invocation. The OpenAI API is getting complicated with the API and SDK changing on OpenAI's end, but this not getting mapped through to other services which are 'compatible' with OpenAI. * Update OpenAI test for this change * Trying running tests with Python 3.13	2025-11-20 17:56:31 +00:00
cybermaggedon	6c85038c75	Ontology extraction tests (#560 )	2025-11-13 20:02:12 +00:00
cybermaggedon	4c3db4dbbe	MCP auth for the simple case (#557 ) * MCP auth token header * Mention limitations * Fix AgentStep schema error by converting argument values to strings. * Added tests for MCP auth and agent step parsing	2025-11-11 12:28:53 +00:00
cybermaggedon	d9d4c91363	Dynamic embeddings model (#556 ) * Dynamic embeddings model selection * Added tests * HF embeddings are skipped, tests don't run with that package currently tests	2025-11-10 20:38:01 +00:00
cybermaggedon	6129bb68c1	Fix hard coded vector size (#555 ) * Fixed hard-coded embeddings store size * Vector store lazy-creates collections, different collections for different dimension lengths. * Added tech spec for vector store lifecycle * Fixed some tests for the new spec	2025-11-10 16:56:51 +00:00
cybermaggedon	51107008fd	master -> 1.5 (README updates) (#552 )	2025-10-11 11:46:03 +01:00
cybermaggedon	52b133fc86	Collection delete pt. 3 (#542 ) * Fixing collection deletion * Fixing collection management param error * Always test for collections * Add Cassandra collection table * Updated tech spec for explicit creation/deletion * Remove implicit collection creation * Fix up collection tracking in all processors	2025-09-30 16:02:33 +01:00
cybermaggedon	8929a680a1	Chunking dynamic params (#536 ) * Chunking params are dynamic * Update tests	2025-09-26 10:53:32 +01:00
cybermaggedon	43cfcb18a0	More LLM param test coverage (#535 ) * More LLM tests * Fixing tests	2025-09-26 01:00:30 +01:00
cybermaggedon	b0a3716b0e	Tests are failing (#534 ) * Fix tests, update to new model parameter usage	2025-09-25 21:32:19 +01:00
cybermaggedon	45a14b5958	Graph rag optimisations (#527 ) * Tech spec for GraphRAG optimisation * Implement GraphRAG optimisation and update tests	2025-09-23 21:05:51 +01:00
cybermaggedon	fcd15d1833	Collection management part 2 (#522 ) * Plumb collection manager into librarian * Test end-to-end	2025-09-19 16:08:47 +01:00
cybermaggedon	d378db9370	Cassandra performance enhancement (#521 ) * Tech spec * Tech spec complete * Cassandra multi-table for performance	2025-09-18 19:52:05 +01:00
cybermaggedon	13ff7d765d	Collection management (#520 ) * Tech spec * Refactored Cassanda knowledge graph for single table * Collection management, librarian services to manage metadata and collection deletion	2025-09-18 15:57:52 +01:00
cybermaggedon	48016d8fb2	Added XML, JSON, CSV detection (#519 ) * Improved XML detect, added schema selection * Add schema select + tests * API additions * More tests * Fixed tests	2025-09-16 23:53:43 +01:00

1 2

84 commits