trustgraph

mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-05-04 12:52:36 +02:00

Author	SHA1	Message	Date
cybermaggedon	286f762369	The id field in pipeline Metadata was being overwritten at each processing (#686 ) The id field in pipeline Metadata was being overwritten at each processing stage (document → page → chunk), causing knowledge storage to create separate cores per chunk instead of grouping by document. Add a root field that: - Is set by librarian to the original document ID - Is copied unchanged through PDF decoder, chunkers, and extractors - Is used by knowledge storage for document_id grouping (with fallback to id) Changes: - Add root field to Metadata schema with empty string default - Set root=document.id in librarian when initiating document processing - Copy root through PDF decoder, recursive chunker, and all extractors - Update knowledge storage to use root (or id as fallback) for grouping - Add root handling to translators and gateway serialization - Update test mock Metadata class to include root parameter	2026-03-11 12:16:39 +00:00
cybermaggedon	aa4f5c6c00	Remove redundant metadata (#685 ) The metadata field (list of triples) in the pipeline Metadata class was redundant. Document metadata triples already flow directly from librarian to triple-store via emit_document_provenance() - they don't need to pass through the extraction pipeline. Additionally, chunker and PDF decoder were overwriting metadata to [] anyway, so any metadata passed through the pipeline was being discarded. Changes: - Remove metadata field from Metadata dataclass (schema/core/metadata.py) - Update all Metadata instantiations to remove metadata=[] parameter - Remove metadata handling from translators (document_loading, knowledge) - Remove metadata consumption from extractors (ontology, agent) - Update gateway serializers and import handlers - Update all unit, integration, and contract tests	2026-03-11 10:51:39 +00:00
cybermaggedon	3c3e11bef5	Fix/librarian broken (#674 ) * Set end-of-stream cleanly - clean streaming message structures * Add tg-get-document-content	2026-03-09 13:36:24 +00:00
cybermaggedon	df1808768d	Fix/doc streaming proto (#673 ) * Librarian streaming doc download * Document stream download endpoint	2026-03-09 12:36:10 +00:00
cybermaggedon	cd5580be59	Extract-time provenance (#661 ) 1. Shared Provenance Module - URI generators, namespace constants, triple builders, vocabulary bootstrap 2. Librarian - Emits document metadata to graph on processing initiation (vocabulary bootstrap + PROV-O triples) 3. PDF Extractor - Saves pages as child documents, emits parent-child provenance edges, forwards page IDs 4. Chunker - Saves chunks as child documents, emits provenance edges, forwards chunk ID + content 5. Knowledge Extractors (both definitions and relationships): - Link entities to chunks via SUBJECT_OF (not top-level document) - Removed duplicate metadata emission (now handled by librarian) - Get chunk_doc_id and chunk_uri from incoming Chunk message 6. Embedding Provenance: - EntityContext schema has chunk_id field - EntityEmbeddings schema has chunk_id field - Definitions extractor sets chunk_id when creating EntityContext - Graph embeddings processor passes chunk_id through to EntityEmbeddings Provenance Flow: Document → Page (PDF) → Chunk → Extracted Facts/Embeddings ↓ ↓ ↓ ↓ librarian librarian librarian (chunk_id reference) + graph + graph + graph Each artifact is stored in librarian with parent-child linking, and PROV-O edges are emitted to the knowledge graph for full traceability from any extracted fact back to its source document. Also, updating tests	2026-03-05 18:36:10 +00:00
cybermaggedon	d8f0a576af	Document API updates (#660 ) * Doc streaming from librarian * Fix chunk minimum confusion * Add CLI args	2026-03-05 15:20:45 +00:00
cybermaggedon	a630e143ef	Incremental / large document loading (#659 ) Tech spec BlobStore (trustgraph-flow/trustgraph/librarian/blob_store.py): - get_stream() - yields document content in chunks for streaming retrieval - create_multipart_upload() - initializes S3 multipart upload, returns upload_id - upload_part() - uploads a single part, returns etag - complete_multipart_upload() - finalizes upload with part etags - abort_multipart_upload() - cancels and cleans up Cassandra schema (trustgraph-flow/trustgraph/tables/library.py): - New upload_session table with 24-hour TTL - Index on user for listing sessions - Prepared statements for all operations - Methods: create_upload_session(), get_upload_session(), update_upload_session_chunk(), delete_upload_session(), list_upload_sessions() - Schema extended with UploadSession, UploadProgress, and new request/response fields - Librarian methods: begin_upload, upload_chunk, complete_upload, abort_upload, get_upload_status, list_uploads - Service routing for all new operations - Python SDK with transparent chunked upload: - add_document() auto-switches to chunked for files > 10MB - Progress callback support (on_progress) - get_pending_uploads(), get_upload_status(), abort_upload(), resume_upload() - Document table: Added parent_id and document_type columns with index - Document schema (knowledge/document.py): Added document_id field for streaming retrieval - Librarian operations: - add-child-document for extracted PDF pages - list-children to get child documents - stream-document for chunked content retrieval - Cascade delete removes children when parent is deleted - list-documents filters children by default - PDF decoder (decoding/pdf/pdf_decoder.py): Updated to stream large documents from librarian API to temp file - Librarian service (librarian/service.py): Sends document_id instead of content for large PDFs (>2MB) - Deprecated tools (load_pdf.py, load_text.py): Added deprecation warnings directing users to tg-add-library-document + tg-start-library-processing Remove load_pdf and load_text utils Move chunker/librarian comms to base class Updating tests	2026-03-04 16:57:58 +00:00
cybermaggedon	387afee7b7	Fix load-doc (#610 )	2026-01-14 15:46:29 +00:00
cybermaggedon	25563bae3c	Change MinIO integration options in librarian to be more generic - to support a Garage integration (#594 ) * Tweak object store parameters to be more generic for other S3-type store integration * Update librarian to have region & SSL params * Update MinIO migration tech spec	2025-12-27 18:01:51 +00:00
cybermaggedon	34eb083836	Messaging fabric plugins (#592 ) * Plugin architecture for messaging fabric * Schemas use a technology neutral expression * Schemas strictness has uncovered some incorrect schema use which is fixed	2025-12-17 21:40:43 +00:00
cybermaggedon	7d07f802a8	Basic multitenant support (#583 ) * Tech spec * Address multi-tenant queue option problems in CLI * Modified collection service to use config * Changed storage management to use the config service definition	2025-12-05 21:45:30 +00:00
cybermaggedon	fcd15d1833	Collection management part 2 (#522 ) * Plumb collection manager into librarian * Test end-to-end	2025-09-19 16:08:47 +01:00
cybermaggedon	c078ca45cd	Refactor more Cassandra stuff to use the helper (#490 )	2025-09-04 12:54:58 +01:00
cybermaggedon	85e669c763	Fixing more Cassandra consistency issues (#488 ) * Fixing more Cassandra work * Fix tests	2025-09-04 00:58:11 +01:00
cybermaggedon	dd70aade11	Implement logging strategy (#444 ) * Logging strategy and convert all prints() to logging invocations	2025-07-30 23:18:38 +01:00
cybermaggedon	807c19fd22	knowledge service (#367 ) * Write knowledge core elements to Cassandra * Store service works, building management service * kg-manager	2025-05-06 23:44:10 +01:00
cybermaggedon	8146f0f2ff	Librarian doc submission (#362 )	2025-05-04 22:56:47 +01:00
cybermaggedon	ff28d26f4d	Feature/flow librarian (#361 ) * Update librarian to new API * Implementing new schema with document + processing objects	2025-05-04 22:26:19 +01:00
cybermaggedon	a9197d11ee	Feature/configure flows (#345 ) - Keeps processing in different flows separate so that data can go to different stores / collections etc. - Potentially supports different processing flows - Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow	2025-04-22 20:21:38 +01:00
cybermaggedon	f1559c5944	Feature/librarian (#310 ) * Add fields to library schema * Added list function, incomplete * Librarian list operation	2025-03-11 16:52:59 +00:00
cybermaggedon	f7df2df266	Feature/librarian (#307 ) * Bring QDrant up-to-date * Tables for data from queue outputs - Pass single Pulsar client to everything in gateway & librarian - Pulsar listener-name support in gateway - PDF and text load working in librarian * Complete Cassandra schema * Add librarian support to templates	2025-02-12 23:39:24 +00:00
cybermaggedon	f350abb415	Maint/asyncio (#305 ) * Move to asyncio services, even though everything is largely sync	2025-02-11 23:24:46 +00:00
cybermaggedon	a0bf2362f6	Librarian (#304 )	2025-02-11 16:01:03 +00:00

23 commits