trustgraph

mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-06-30 00:49:38 +02:00

Author	SHA1	Message	Date
cybermaggedon	c951562189	Graph query CLI tool (#679 ) New CLI tool that enables selective queries against the triple store unlike tg-show-graph which dumps the entire graph. Features: - Filter by subject, predicate, object, and/or named graph - Auto-detection of term types (IRI, literal, quoted triple) - Two ways to specify quoted triples: - Inline Turtle-style: -o "<<s p o>>" - Explicit flags: --qt-subject, --qt-predicate, --qt-object - Output formats: space-separated, pipe-separated, JSON, JSON Lines - Streaming mode for efficient large result sets Auto-detection rules: - http://, https://, urn:, or <wrapped> -> IRI - <<s p o>> -> quoted triple - Otherwise -> literal	2026-03-10 11:03:34 +00:00
cybermaggedon	ec83775789	Update tech spec (#678 )	2026-03-10 10:07:37 +00:00
cybermaggedon	7a6197d8c3	GraphRAG Query-Time Explainability (#677 ) Implements full explainability pipeline for GraphRAG queries, enabling traceability from answers back to source documents. Renamed throughout for clarity: - provenance_callback → explain_callback - provenance_id → explain_id - provenance_collection → explain_collection - message_type "provenance" → "explain" - Queue name "provenance" → "explainability" GraphRAG queries now emit explainability events as they execute: 1. Session - query text and timestamp 2. Retrieval - edges retrieved from subgraph 3. Selection - selected edges with LLM reasoning (JSONL with id + reasoning) 4. Answer - reference to synthesized response Events stream via explain_callback during query(), enabling real-time UX. - Answers stored in librarian service (not inline in graph - too large) - Document ID as URN: urn:trustgraph:answer:{session_id} - Graph stores tg:document reference (IRI) to librarian document - Added librarian producer/consumer to graph-rag service - get_labelgraph() now returns (labeled_edges, uri_map) - uri_map maps edge_id(label_s, label_p, label_o) → (uri_s, uri_p, uri_o) - Explainability data stores original URIs, not labels - Enables tracing edges back to reifying statements via tg:reifies - Added serialize_triple() to query service (matches storage format) - get_term_value() now handles TRIPLE type terms - Enables querying by quoted triple in object position: ?stmt tg:reifies <<s p o>> - Displays real-time explainability events during query - Resolves rdfs:label for edge components (s, p, o) - Traces source chain via prov:wasDerivedFrom to root document - Output: "Source: Chunk 1 → Page 2 → Document Title" - Label caching to avoid repeated queries GraphRagResponse: - explain_id: str \| None - explain_collection: str \| None - message_type: str ("chunk" or "explain") - end_of_session: bool trustgraph-base/trustgraph/provenance/: - namespaces.py - Added TG_DOCUMENT predicate - triples.py - answer_triples() supports document_id reference - uris.py - Added edge_selection_uri() trustgraph-base/trustgraph/schema/services/retrieval.py: - GraphRagResponse with explain_id, explain_collection, end_of_session trustgraph-flow/trustgraph/retrieval/graph_rag/: - graph_rag.py - URI preservation, streaming answer accumulation - rag.py - Librarian integration, real-time explain emission trustgraph-flow/trustgraph/query/triples/cassandra/service.py: - Quoted triple serialization for query matching trustgraph-cli/trustgraph/cli/invoke_graph_rag.py: - Full explainability display with label resolution and source tracing	2026-03-10 10:00:01 +00:00
cybermaggedon	d2d71f859d	Feature/streaming triples (#676 ) * Steaming triples * Also GraphRAG service uses this * Updated tests	2026-03-09 15:46:33 +00:00
cybermaggedon	3c3e11bef5	Fix/librarian broken (#674 ) * Set end-of-stream cleanly - clean streaming message structures * Add tg-get-document-content	2026-03-09 13:36:24 +00:00
cybermaggedon	df1808768d	Fix/doc streaming proto (#673 ) * Librarian streaming doc download * Document stream download endpoint	2026-03-09 12:36:10 +00:00
cybermaggedon	b2ef7bbb8c	Fix doc embeddings invocation (#672 ) * Fix doc embeddings invocation * Tidy query embeddings invocation	2026-03-09 11:07:32 +00:00
cybermaggedon	f2ae0e8623	Embeddings API scores (#671 ) - Put scores in all responses - Remove unused 'middle' vector layer. Vector of texts -> vector of (vector embedding)	2026-03-09 10:53:44 +00:00
cybermaggedon	4fa7cc7d7c	Fix/embeddings integration 2 (#670 )	2026-03-08 19:42:26 +00:00
cybermaggedon	919b760c05	Update embeddings integration for new batch embeddings interfaces (#669 ) * Fix vector extraction * Fix embeddings integration	2026-03-08 19:41:52 +00:00
cybermaggedon	0a2ce47a88	Batch embeddings (#668 ) Base Service (trustgraph-base/trustgraph/base/embeddings_service.py): - Changed on_request to use request.texts FastEmbed Processor (trustgraph-flow/trustgraph/embeddings/fastembed/processor.py): - on_embeddings(texts, model=None) now processes full batch efficiently - Returns [[v.tolist()] for v in vecs] - list of vector sets Ollama Processor (trustgraph-flow/trustgraph/embeddings/ollama/processor.py): - on_embeddings(texts, model=None) passes list directly to Ollama - Returns [[embedding] for embedding in embeds.embeddings] EmbeddingsClient (trustgraph-base/trustgraph/base/embeddings_client.py): - embed(texts, timeout=300) accepts list of texts Tests Updated: - test_fastembed_dynamic_model.py - 4 tests updated for new interface - test_ollama_dynamic_model.py - 4 tests updated for new interface Updated CLI, SDK and APIs	2026-03-08 18:36:54 +00:00
cybermaggedon	3bf8a65409	Fix tests (#666 )	2026-03-07 23:38:09 +00:00
cybermaggedon	24bbe94136	Document chunks not stored in vector store (#665 ) - Schema - ChunkEmbeddings now uses chunk_id: str instead of chunk: bytes - Schema - DocumentEmbeddingsResponse now returns chunk_ids: list[str] instead of chunks - Translators - Updated to serialize/deserialize chunk_id - Clients - DocumentEmbeddingsClient.query() returns chunk_ids - SDK/API - flow.py, socket_client.py, bulk_client.py updated - Document embeddings service - Stores chunk_id (document ID) instead of chunk text - Storage writers - Qdrant, Milvus, Pinecone store chunk_id in payload - Query services - Return chunk_id from vector store searches - Gateway dispatchers - Serialize chunk_id in API responses - Document RAG - Added librarian client to fetch chunk content from Garage using chunk_ids - CLI tools - Updated all three tools: - invoke_document_embeddings.py - displays chunk_ids, removed max_chunk_length - save_doc_embeds.py - exports chunk_id - load_doc_embeds.py - imports chunk_id	2026-03-07 23:10:45 +00:00
cybermaggedon	be358efe67	Fix tests (#663 )	2026-03-06 12:40:02 +00:00
cybermaggedon	2b9232917c	Fix/extraction prov (#662 ) Quoted triple fixes, including... 1. Updated triple_provenance_triples() in triples.py: - Now accepts a Triple object directly - Creates the reification triple using TRIPLE term type: stmt_uri tg:reifies <<extracted_triple>> - Includes it in the returned provenance triples 2. Updated definitions extractor: - Added imports for provenance functions and component version - Added ParameterSpec for optional llm-model and ontology flow parameters - For each definition triple, generates provenance with reification 3. Updated relationships extractor: - Same changes as definitions extractor	2026-03-06 12:23:58 +00:00
cybermaggedon	cd5580be59	Extract-time provenance (#661 ) 1. Shared Provenance Module - URI generators, namespace constants, triple builders, vocabulary bootstrap 2. Librarian - Emits document metadata to graph on processing initiation (vocabulary bootstrap + PROV-O triples) 3. PDF Extractor - Saves pages as child documents, emits parent-child provenance edges, forwards page IDs 4. Chunker - Saves chunks as child documents, emits provenance edges, forwards chunk ID + content 5. Knowledge Extractors (both definitions and relationships): - Link entities to chunks via SUBJECT_OF (not top-level document) - Removed duplicate metadata emission (now handled by librarian) - Get chunk_doc_id and chunk_uri from incoming Chunk message 6. Embedding Provenance: - EntityContext schema has chunk_id field - EntityEmbeddings schema has chunk_id field - Definitions extractor sets chunk_id when creating EntityContext - Graph embeddings processor passes chunk_id through to EntityEmbeddings Provenance Flow: Document → Page (PDF) → Chunk → Extracted Facts/Embeddings ↓ ↓ ↓ ↓ librarian librarian librarian (chunk_id reference) + graph + graph + graph Each artifact is stored in librarian with parent-child linking, and PROV-O edges are emitted to the knowledge graph for full traceability from any extracted fact back to its source document. Also, updating tests	2026-03-05 18:36:10 +00:00
cybermaggedon	d8f0a576af	Document API updates (#660 ) * Doc streaming from librarian * Fix chunk minimum confusion * Add CLI args	2026-03-05 15:20:45 +00:00
cybermaggedon	a630e143ef	Incremental / large document loading (#659 ) Tech spec BlobStore (trustgraph-flow/trustgraph/librarian/blob_store.py): - get_stream() - yields document content in chunks for streaming retrieval - create_multipart_upload() - initializes S3 multipart upload, returns upload_id - upload_part() - uploads a single part, returns etag - complete_multipart_upload() - finalizes upload with part etags - abort_multipart_upload() - cancels and cleans up Cassandra schema (trustgraph-flow/trustgraph/tables/library.py): - New upload_session table with 24-hour TTL - Index on user for listing sessions - Prepared statements for all operations - Methods: create_upload_session(), get_upload_session(), update_upload_session_chunk(), delete_upload_session(), list_upload_sessions() - Schema extended with UploadSession, UploadProgress, and new request/response fields - Librarian methods: begin_upload, upload_chunk, complete_upload, abort_upload, get_upload_status, list_uploads - Service routing for all new operations - Python SDK with transparent chunked upload: - add_document() auto-switches to chunked for files > 10MB - Progress callback support (on_progress) - get_pending_uploads(), get_upload_status(), abort_upload(), resume_upload() - Document table: Added parent_id and document_type columns with index - Document schema (knowledge/document.py): Added document_id field for streaming retrieval - Librarian operations: - add-child-document for extracted PDF pages - list-children to get child documents - stream-document for chunked content retrieval - Cascade delete removes children when parent is deleted - list-documents filters children by default - PDF decoder (decoding/pdf/pdf_decoder.py): Updated to stream large documents from librarian API to temp file - Librarian service (librarian/service.py): Sends document_id instead of content for large PDFs (>2MB) - Deprecated tools (load_pdf.py, load_text.py): Added deprecation warnings directing users to tg-add-library-document + tg-start-library-processing Remove load_pdf and load_text utils Move chunker/librarian comms to base class Updating tests	2026-03-04 16:57:58 +00:00
cybermaggedon	a38ca9474f	Tool services - dynamically pluggable tool implementations for agent frameworks (#658 ) * New schema * Tool service implementation * Base class * Joke service, for testing * Update unit tests for tool services	2026-03-04 14:51:32 +00:00
cybermaggedon	0b83c08ae4	Use model in Azure LLM integration (#657 )	2026-03-04 12:06:06 +00:00
cybermaggedon	e19ea8667d	Tool services tech spec (#656 )	2026-02-28 14:46:13 +00:00
cybermaggedon	4d31cd4c03	Agent explainability tech specs (#655 ) * Query time provenance tech spec * Extraction provenance placeholder	2026-02-28 14:44:18 +00:00
cybermaggedon	88fe8468bc	Update CI for 2.1 release (#653 )	2026-02-28 11:10:11 +00:00
cybermaggedon	b915602635	Merge master into release/v2.1 (#652 ) * Just to align the README, making future merge easier --------- Co-authored-by: Jack Colquitt <126733989+JackColquitt@users.noreply.github.com>	2026-02-28 11:07:03 +00:00
cybermaggedon	6d8da748d7	Fix mismatching ge-query / graph-embeddings-query service idents (#648 )	2026-02-24 12:17:29 +00:00
cybermaggedon	7d2d59a80f	Fix/tests (#647 )	2026-02-23 22:01:47 +00:00
cybermaggedon	4bbc6d844f	Row embeddings APIs exposed (#646 ) * Added row embeddings API and CLI support * Updated protocol specs * Row embeddings agent tool * Add new agent tool to CLI	2026-02-23 21:52:56 +00:00
cybermaggedon	1809c1f56d	Structured data 2 (#645 ) * Structured data refactor - multi-index tables, remove need for manual mods to the Cassandra tables * Tech spec updated to track implementation	2026-02-23 15:56:29 +00:00
cybermaggedon	5ffad92345	Fix subscriber unexpected message causing queue clogging (#642 ) queue clogging.	2026-02-23 14:34:05 +00:00
cybermaggedon	0116eb3dea	Fix Goog AI Studio (#641 )	2026-02-20 10:27:47 +00:00
cybermaggedon	08063a5ee9	Remove unused deps (#640 ) * Removed the Google GenAI hard-coded install	2026-02-20 10:13:44 +00:00
cybermaggedon	2d8dbf4cdb	Move GAIStudio to vertexai package to simplify deps (#639 )	2026-02-20 08:46:29 +00:00
cybermaggedon	769c56bbea	Use ClientError & code to determine 429 error (#638 )	2026-02-20 08:00:07 +00:00
cybermaggedon	b2e768c309	Fixing Uri import error (#636 )	2026-02-16 19:18:40 +00:00
cybermaggedon	89b69fdb08	Fix weird Onttology URI issue (#637 )	2026-02-16 19:18:29 +00:00
cybermaggedon	d886358be6	Entity & triple batch size limits (#635 ) * Entities and triples are emitted in batches with a batch limit to manage overloading downstream. * Update tests	2026-02-16 17:38:03 +00:00
cybermaggedon	fe389354f6	Fix d/g attribute error (#634 )	2026-02-16 13:34:08 +00:00
cybermaggedon	00c1ca681b	Entity-centric graph (#633 ) * Tech spec for new entity-centric graph schema * Graph implementation	2026-02-16 13:26:43 +00:00
cybermaggedon	f24f1ebd80	Migrate to VertexAI to google-genai SDK from deprecated library (#632 ) * Migrate to VertexAI to google-genai SDK from deprecated library * Fix tests, mock the correct API	2026-02-09 20:43:33 +00:00
cybermaggedon	2781c7d87c	Fix LLM metrics (#631 ) * Fix mistral metrics * Fix to other models	2026-02-09 19:35:42 +00:00
cybermaggedon	4fca97d555	Output the entity term as well as its definition as entity contexts (#629 )	2026-02-09 15:18:05 +00:00
cybermaggedon	8574861196	Protect null embeddings - v2.0 (#627 ) * Don't emit graph embeddings if there aren't any. * Don't store graph embeddings in a knowledge store if there's an empty list. * Translate between Cassandra's 'null' representing an empty list and an empty list which is what the surrounding code wants (and stored in the first place). * Avoid emitting empty embedding lists * Avoid output empty triple lists * Fix tests	2026-02-09 14:57:36 +00:00
cybermaggedon	98827e5561	Fix version needing updating in pipelines (#625 )	2026-02-04 14:12:01 +00:00
cybermaggedon	6bf08c3ace	Feature/more cli diags (#624 ) * CLI tools for tg-invoke-graph-embeddings, tg-invoke-document-embeddings, and tg-invoke-embeddings. Just useful for diagnostics. * Fix tg-load-knowledge	2026-02-04 14:10:30 +00:00
cybermaggedon	23cc4dfdd1	Fix: version needed updating in pipelines (#623 )	2026-01-27 15:42:01 +00:00
cybermaggedon	cf0daedefa	Changed schema for Value -> Term, majorly breaking change (#622 ) * Changed schema for Value -> Term, majorly breaking change * Following the schema change, Value -> Term into all processing * Updated Cassandra for g, p, s, o index patterns (7 indexes) * Reviewed and updated all tests * Neo4j, Memgraph and FalkorDB remain broken, will look at once settled down	2026-01-27 13:48:08 +00:00
cybermaggedon	e061f2c633	Graph contexts tech spec (#621 )	2026-01-26 22:41:00 +00:00
cybermaggedon	e214eb4e02	Feature/prompts jsonl (#619 ) * Tech spec * JSONL implementation complete * Updated prompt client users * Fix tests	2026-01-26 17:38:00 +00:00
cybermaggedon	e4f0013841	Open 1.9 branch (#620 )	2026-01-26 17:36:25 +00:00
cybermaggedon	11f41b07ab	Get neo4j to use limit (#618 ) * Get neo4j to use limit * Fix tests - they we exact matching on query strings	2026-01-22 15:16:34 +00:00

1 2 3 4 5 ...

1121 commits