Tech spec
BlobStore (trustgraph-flow/trustgraph/librarian/blob_store.py):
- get_stream() - yields document content in chunks for streaming retrieval
- create_multipart_upload() - initializes S3 multipart upload, returns
upload_id
- upload_part() - uploads a single part, returns etag
- complete_multipart_upload() - finalizes upload with part etags
- abort_multipart_upload() - cancels and cleans up
Cassandra schema (trustgraph-flow/trustgraph/tables/library.py):
- New upload_session table with 24-hour TTL
- Index on user for listing sessions
- Prepared statements for all operations
- Methods: create_upload_session(), get_upload_session(),
update_upload_session_chunk(), delete_upload_session(),
list_upload_sessions()
- Schema extended with UploadSession, UploadProgress, and new
request/response fields
- Librarian methods: begin_upload, upload_chunk, complete_upload,
abort_upload, get_upload_status, list_uploads
- Service routing for all new operations
- Python SDK with transparent chunked upload:
- add_document() auto-switches to chunked for files > 10MB
- Progress callback support (on_progress)
- get_pending_uploads(), get_upload_status(), abort_upload(),
resume_upload()
- Document table: Added parent_id and document_type columns with index
- Document schema (knowledge/document.py): Added document_id field for
streaming retrieval
- Librarian operations:
- add-child-document for extracted PDF pages
- list-children to get child documents
- stream-document for chunked content retrieval
- Cascade delete removes children when parent is deleted
- list-documents filters children by default
- PDF decoder (decoding/pdf/pdf_decoder.py): Updated to stream large
documents from librarian API to temp file
- Librarian service (librarian/service.py): Sends document_id instead of
content for large PDFs (>2MB)
- Deprecated tools (load_pdf.py, load_text.py): Added deprecation
warnings directing users to tg-add-library-document +
tg-start-library-processing
Remove load_pdf and load_text utils
Move chunker/librarian comms to base class
Updating tests
* Changed schema for Value -> Term, majorly breaking change
* Following the schema change, Value -> Term into all processing
* Updated Cassandra for g, p, s, o index patterns (7 indexes)
* Reviewed and updated all tests
* Neo4j, Memgraph and FalkorDB remain broken, will look at once settled down
* Removed legacy storage management cruft. Tidied tech specs.
* Fix deletion of last collection
* Storage processor ignores data on the queue which is for a deleted collection
* Updated tests
* Tweak object store parameters to be more generic for other S3-type store integration
* Update librarian to have region & SSL params
* Update MinIO migration tech spec
* Plugin architecture for messaging fabric
* Schemas use a technology neutral expression
* Schemas strictness has uncovered some incorrect schema use which is fixed
* Tech spec
* Address multi-tenant queue option problems in CLI
* Modified collection service to use config
* Changed storage management to use the config service definition
* Tech spec
* Python CLI utilities updated to use the API including streaming features
* Added type safety to Python API
* Completed missing auth token support in CLI
* Tidy up duplicate tech specs in doc directory
* Streaming LLM text-completion service tech spec.
* text-completion and prompt interfaces
* streaming change applied to all LLMs, so far tested with VertexAI
* Skip Pinecone unit tests, upstream module issue is affecting things, tests are passing again
* Added agent streaming, not working and has broken tests
* Onto-rag tech spec
* New processor kg-extract-ontology, use 'ontology' objects from config to guide triple extraction
* Also entity contexts
* Integrate with ontology extractor from workbench
This is first phase, the extraction is tested and working, also GraphRAG with the extracted knowledge works
* Fixed hard-coded embeddings store size
* Vector store lazy-creates collections, different collections for
different dimension lengths.
* Added tech spec for vector store lifecycle
* Fixed some tests for the new spec
* Fix pyproject.toml missing requests dep
* parameters is now parameter-types
* Update flow parameters tech spec for recent changes (no impact on this repo)
* Tech spec for graceful shutdown
* Graceful shutdown of importers/exporters
* Update socket to include graceful shutdown orchestration
* Adding tests for conditions tracked in this PR