trustgraph

mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-06-13 00:35:14 +02:00

Author	SHA1	Message	Date
cybermaggedon	a630e143ef	Incremental / large document loading (#659 ) Tech spec BlobStore (trustgraph-flow/trustgraph/librarian/blob_store.py): - get_stream() - yields document content in chunks for streaming retrieval - create_multipart_upload() - initializes S3 multipart upload, returns upload_id - upload_part() - uploads a single part, returns etag - complete_multipart_upload() - finalizes upload with part etags - abort_multipart_upload() - cancels and cleans up Cassandra schema (trustgraph-flow/trustgraph/tables/library.py): - New upload_session table with 24-hour TTL - Index on user for listing sessions - Prepared statements for all operations - Methods: create_upload_session(), get_upload_session(), update_upload_session_chunk(), delete_upload_session(), list_upload_sessions() - Schema extended with UploadSession, UploadProgress, and new request/response fields - Librarian methods: begin_upload, upload_chunk, complete_upload, abort_upload, get_upload_status, list_uploads - Service routing for all new operations - Python SDK with transparent chunked upload: - add_document() auto-switches to chunked for files > 10MB - Progress callback support (on_progress) - get_pending_uploads(), get_upload_status(), abort_upload(), resume_upload() - Document table: Added parent_id and document_type columns with index - Document schema (knowledge/document.py): Added document_id field for streaming retrieval - Librarian operations: - add-child-document for extracted PDF pages - list-children to get child documents - stream-document for chunked content retrieval - Cascade delete removes children when parent is deleted - list-documents filters children by default - PDF decoder (decoding/pdf/pdf_decoder.py): Updated to stream large documents from librarian API to temp file - Librarian service (librarian/service.py): Sends document_id instead of content for large PDFs (>2MB) - Deprecated tools (load_pdf.py, load_text.py): Added deprecation warnings directing users to tg-add-library-document + tg-start-library-processing Remove load_pdf and load_text utils Move chunker/librarian comms to base class Updating tests	2026-03-04 16:57:58 +00:00
cybermaggedon	77fdec2c2d	Fix chunk params not converted (#549 )	2025-10-07 00:04:34 +01:00
cybermaggedon	8929a680a1	Chunking dynamic params (#536 ) * Chunking params are dynamic * Update tests	2025-09-26 10:53:32 +01:00
cybermaggedon	a7de175b33	Fix token chunker, broken API invocation (#454 )	2025-08-08 14:41:24 +01:00
cybermaggedon	dd70aade11	Implement logging strategy (#444 ) * Logging strategy and convert all prints() to logging invocations	2025-07-30 23:18:38 +01:00
cybermaggedon	a9197d11ee	Feature/configure flows (#345 ) - Keeps processing in different flows separate so that data can go to different stores / collections etc. - Potentially supports different processing flows - Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow	2025-04-22 20:21:38 +01:00
cybermaggedon	f350abb415	Maint/asyncio (#305 ) * Move to asyncio services, even though everything is largely sync	2025-02-11 23:24:46 +00:00
cybermaggedon	7954e863cc	Feature: document metadata (#123 ) * Rework metadata structure in processing messages to be a subgraph * Add subgraph creation for tg-load-pdf and tg-load-text based on command-line passing of doc attributes * Document metadata is added to knowledge graph with subjectOf linkage to extracted entities	2024-10-23 18:04:04 +01:00
cybermaggedon	b0f4c58200	Feature / collections (#96 ) * Update schema defs for source -> metadata * Migrate to use metadata part of schema, also add metadata to triples & vecs * Add user/collection metadata to query * Use user/collection in RAG * Write and query working on triples	2024-10-02 18:14:29 +01:00
cybermaggedon	9b91d5eee3	Feature/pkgsplit (#83 ) * Starting to spawn base package * More package hacking * Bedrock and VertexAI * Parquet split * Updated templates * Utils	2024-09-30 19:36:09 +01:00

10 commits