Tech spec
BlobStore (trustgraph-flow/trustgraph/librarian/blob_store.py):
- get_stream() - yields document content in chunks for streaming retrieval
- create_multipart_upload() - initializes S3 multipart upload, returns
upload_id
- upload_part() - uploads a single part, returns etag
- complete_multipart_upload() - finalizes upload with part etags
- abort_multipart_upload() - cancels and cleans up
Cassandra schema (trustgraph-flow/trustgraph/tables/library.py):
- New upload_session table with 24-hour TTL
- Index on user for listing sessions
- Prepared statements for all operations
- Methods: create_upload_session(), get_upload_session(),
update_upload_session_chunk(), delete_upload_session(),
list_upload_sessions()
- Schema extended with UploadSession, UploadProgress, and new
request/response fields
- Librarian methods: begin_upload, upload_chunk, complete_upload,
abort_upload, get_upload_status, list_uploads
- Service routing for all new operations
- Python SDK with transparent chunked upload:
- add_document() auto-switches to chunked for files > 10MB
- Progress callback support (on_progress)
- get_pending_uploads(), get_upload_status(), abort_upload(),
resume_upload()
- Document table: Added parent_id and document_type columns with index
- Document schema (knowledge/document.py): Added document_id field for
streaming retrieval
- Librarian operations:
- add-child-document for extracted PDF pages
- list-children to get child documents
- stream-document for chunked content retrieval
- Cascade delete removes children when parent is deleted
- list-documents filters children by default
- PDF decoder (decoding/pdf/pdf_decoder.py): Updated to stream large
documents from librarian API to temp file
- Librarian service (librarian/service.py): Sends document_id instead of
content for large PDFs (>2MB)
- Deprecated tools (load_pdf.py, load_text.py): Added deprecation
warnings directing users to tg-add-library-document +
tg-start-library-processing
Remove load_pdf and load_text utils
Move chunker/librarian comms to base class
Updating tests
* Don't emit graph embeddings if there aren't any.
* Don't store graph embeddings in a knowledge store if there's an empty list.
* Translate between Cassandra's 'null' representing an empty list and an
empty list which is what the surrounding code wants (and stored in the
first place).
* Avoid emitting empty embedding lists
* Avoid output empty triple lists
* Fix tests
* CLI tools for tg-invoke-graph-embeddings, tg-invoke-document-embeddings,
and tg-invoke-embeddings. Just useful for diagnostics.
* Fix tg-load-knowledge
* Changed schema for Value -> Term, majorly breaking change
* Following the schema change, Value -> Term into all processing
* Updated Cassandra for g, p, s, o index patterns (7 indexes)
* Reviewed and updated all tests
* Neo4j, Memgraph and FalkorDB remain broken, will look at once settled down
* Removed legacy storage management cruft. Tidied tech specs.
* Fix deletion of last collection
* Storage processor ignores data on the queue which is for a deleted collection
* Updated tests
* Tweak object store parameters to be more generic for other S3-type store integration
* Update librarian to have region & SSL params
* Update MinIO migration tech spec
* Plugin architecture for messaging fabric
* Schemas use a technology neutral expression
* Schemas strictness has uncovered some incorrect schema use which is fixed
* Tech spec
* Address multi-tenant queue option problems in CLI
* Modified collection service to use config
* Changed storage management to use the config service definition
* Updates for agent API with streaming support
* Added tg-dump-queues tool to dump Pulsar queues to a log
* Updated tg-invoke-agent, incremental output
* Queue dumper CLI - might be useful for debug
* Updating for tests