Tech spec
BlobStore (trustgraph-flow/trustgraph/librarian/blob_store.py):
- get_stream() - yields document content in chunks for streaming retrieval
- create_multipart_upload() - initializes S3 multipart upload, returns
upload_id
- upload_part() - uploads a single part, returns etag
- complete_multipart_upload() - finalizes upload with part etags
- abort_multipart_upload() - cancels and cleans up
Cassandra schema (trustgraph-flow/trustgraph/tables/library.py):
- New upload_session table with 24-hour TTL
- Index on user for listing sessions
- Prepared statements for all operations
- Methods: create_upload_session(), get_upload_session(),
update_upload_session_chunk(), delete_upload_session(),
list_upload_sessions()
- Schema extended with UploadSession, UploadProgress, and new
request/response fields
- Librarian methods: begin_upload, upload_chunk, complete_upload,
abort_upload, get_upload_status, list_uploads
- Service routing for all new operations
- Python SDK with transparent chunked upload:
- add_document() auto-switches to chunked for files > 10MB
- Progress callback support (on_progress)
- get_pending_uploads(), get_upload_status(), abort_upload(),
resume_upload()
- Document table: Added parent_id and document_type columns with index
- Document schema (knowledge/document.py): Added document_id field for
streaming retrieval
- Librarian operations:
- add-child-document for extracted PDF pages
- list-children to get child documents
- stream-document for chunked content retrieval
- Cascade delete removes children when parent is deleted
- list-documents filters children by default
- PDF decoder (decoding/pdf/pdf_decoder.py): Updated to stream large
documents from librarian API to temp file
- Librarian service (librarian/service.py): Sends document_id instead of
content for large PDFs (>2MB)
- Deprecated tools (load_pdf.py, load_text.py): Added deprecation
warnings directing users to tg-add-library-document +
tg-start-library-processing
Remove load_pdf and load_text utils
Move chunker/librarian comms to base class
Updating tests
32 KiB
Large Document Loading Technical Specification
Overview
This specification addresses scalability and user experience issues when loading large documents into TrustGraph. The current architecture treats document upload as a single atomic operation, causing memory pressure at multiple points in the pipeline and providing no feedback or recovery options to users.
This implementation targets the following use cases:
- Large PDF Processing: Upload and process multi-hundred-megabyte PDF files without exhausting memory
- Resumable Uploads: Allow interrupted uploads to continue from where they left off rather than restarting
- Progress Feedback: Provide users with real-time visibility into upload and processing progress
- Memory-Efficient Processing: Process documents in a streaming fashion without holding entire files in memory
Goals
- Incremental Upload: Support chunked document upload via REST and WebSocket
- Resumable Transfers: Enable recovery from interrupted uploads
- Progress Visibility: Provide upload/processing progress feedback to clients
- Memory Efficiency: Eliminate full-document buffering throughout the pipeline
- Backward Compatibility: Existing small-document workflows continue unchanged
- Streaming Processing: PDF decoding and text chunking operate on streams
Background
Current Architecture
Document submission flows through the following path:
- Client submits document via REST (
POST /api/v1/librarian) or WebSocket - API Gateway receives complete request with base64-encoded document content
- LibrarianRequestor translates request to Pulsar message
- Librarian Service receives message, decodes document into memory
- BlobStore uploads document to Garage/S3
- Cassandra stores metadata with object reference
- For processing: document retrieved from S3, decoded, chunked—all in memory
Key files:
- REST/WebSocket entry:
trustgraph-flow/trustgraph/gateway/service.py - Librarian core:
trustgraph-flow/trustgraph/librarian/librarian.py - Blob storage:
trustgraph-flow/trustgraph/librarian/blob_store.py - Cassandra tables:
trustgraph-flow/trustgraph/tables/library.py - API schema:
trustgraph-base/trustgraph/schema/services/library.py
Current Limitations
The current design has several compounding memory and UX issues:
-
Atomic Upload Operation: The entire document must be transmitted in a single request. Large documents require long-running requests with no progress indication and no retry mechanism if the connection fails.
-
API Design: Both REST and WebSocket APIs expect the complete document in a single message. The schema (
LibrarianRequest) has a singlecontentfield containing the entire base64-encoded document. -
Librarian Memory: The librarian service decodes the entire document into memory before uploading to S3. For a 500MB PDF, this means holding 500MB+ in process memory.
-
PDF Decoder Memory: When processing begins, the PDF decoder loads the entire PDF into memory to extract text. PyPDF and similar libraries typically require full document access.
-
Chunker Memory: The text chunker receives the complete extracted text and holds it in memory while producing chunks.
Memory Impact Example (500MB PDF):
- Gateway: ~700MB (base64 encoding overhead)
- Librarian: ~500MB (decoded bytes)
- PDF Decoder: ~500MB + extraction buffers
- Chunker: extracted text (variable, potentially 100MB+)
Total peak memory can exceed 2GB for a single large document.
Technical Design
Design Principles
-
API Facade: All client interaction goes through the librarian API. Clients have no direct access to or knowledge of the underlying S3/Garage storage.
-
S3 Multipart Upload: Use standard S3 multipart upload under the hood. This is widely supported across S3-compatible systems (AWS S3, MinIO, Garage, Ceph, DigitalOcean Spaces, Backblaze B2, etc.) ensuring portability.
-
Atomic Completion: S3 multipart uploads are inherently atomic - uploaded parts are invisible until
CompleteMultipartUploadis called. No temporary files or rename operations needed. -
Trackable State: Upload sessions tracked in Cassandra, providing visibility into incomplete uploads and enabling resume capability.
Chunked Upload Flow
Client Librarian API S3/Garage
│ │ │
│── begin-upload ───────────►│ │
│ (metadata, size) │── CreateMultipartUpload ────►│
│ │◄── s3_upload_id ─────────────│
│◄── upload_id ──────────────│ (store session in │
│ │ Cassandra) │
│ │ │
│── upload-chunk ───────────►│ │
│ (upload_id, index, data) │── UploadPart ───────────────►│
│ │◄── etag ─────────────────────│
│◄── ack + progress ─────────│ (store etag in session) │
│ ⋮ │ ⋮ │
│ (repeat for all chunks) │ │
│ │ │
│── complete-upload ────────►│ │
│ (upload_id) │── CompleteMultipartUpload ──►│
│ │ (parts coalesced by S3) │
│ │── store doc metadata ───────►│ Cassandra
│◄── document_id ────────────│ (delete session) │
The client never interacts with S3 directly. The librarian translates between our chunked upload API and S3 multipart operations internally.
Librarian API Operations
begin-upload
Initialize a chunked upload session.
Request:
{
"operation": "begin-upload",
"document-metadata": {
"id": "doc-123",
"kind": "application/pdf",
"title": "Large Document",
"user": "user-id",
"tags": ["tag1", "tag2"]
},
"total-size": 524288000,
"chunk-size": 5242880
}
Response:
{
"upload-id": "upload-abc-123",
"chunk-size": 5242880,
"total-chunks": 100
}
The librarian:
- Generates a unique
upload_idandobject_id(UUID for blob storage) - Calls S3
CreateMultipartUpload, receivess3_upload_id - Creates session record in Cassandra
- Returns
upload_idto client
upload-chunk
Upload a single chunk.
Request:
{
"operation": "upload-chunk",
"upload-id": "upload-abc-123",
"chunk-index": 0,
"content": "<base64-encoded-chunk>"
}
Response:
{
"upload-id": "upload-abc-123",
"chunk-index": 0,
"chunks-received": 1,
"total-chunks": 100,
"bytes-received": 5242880,
"total-bytes": 524288000
}
The librarian:
- Looks up session by
upload_id - Validates ownership (user must match session creator)
- Calls S3
UploadPartwith chunk data, receivesetag - Updates session record with chunk index and etag
- Returns progress to client
Failed chunks can be retried - just send the same chunk-index again.
complete-upload
Finalize the upload and create the document.
Request:
{
"operation": "complete-upload",
"upload-id": "upload-abc-123"
}
Response:
{
"document-id": "doc-123",
"object-id": "550e8400-e29b-41d4-a716-446655440000"
}
The librarian:
- Looks up session, verifies all chunks received
- Calls S3
CompleteMultipartUploadwith part etags (S3 coalesces parts internally - zero memory cost to librarian) - Creates document record in Cassandra with metadata and object reference
- Deletes upload session record
- Returns document ID to client
abort-upload
Cancel an in-progress upload.
Request:
{
"operation": "abort-upload",
"upload-id": "upload-abc-123"
}
The librarian:
- Calls S3
AbortMultipartUploadto clean up parts - Deletes session record from Cassandra
get-upload-status
Query status of an upload (for resume capability).
Request:
{
"operation": "get-upload-status",
"upload-id": "upload-abc-123"
}
Response:
{
"upload-id": "upload-abc-123",
"state": "in-progress",
"chunks-received": [0, 1, 2, 5, 6],
"missing-chunks": [3, 4, 7, 8],
"total-chunks": 100,
"bytes-received": 36700160,
"total-bytes": 524288000
}
list-uploads
List incomplete uploads for a user.
Request:
{
"operation": "list-uploads"
}
Response:
{
"uploads": [
{
"upload-id": "upload-abc-123",
"document-metadata": { "title": "Large Document", ... },
"progress": { "chunks-received": 43, "total-chunks": 100 },
"created-at": "2024-01-15T10:30:00Z"
}
]
}
Upload Session Storage
Track in-progress uploads in Cassandra:
CREATE TABLE upload_session (
upload_id text PRIMARY KEY,
user text,
document_id text,
document_metadata text, -- JSON: title, kind, tags, comments, etc.
s3_upload_id text, -- internal, for S3 operations
object_id uuid, -- target blob ID
total_size bigint,
chunk_size int,
total_chunks int,
chunks_received map<int, text>, -- chunk_index → etag
created_at timestamp,
updated_at timestamp
) WITH default_time_to_live = 86400; -- 24 hour TTL
CREATE INDEX upload_session_user ON upload_session (user);
TTL Behavior:
- Sessions expire after 24 hours if not completed
- When Cassandra TTL expires, the session record is deleted
- Orphaned S3 parts are cleaned up by S3 lifecycle policy (configure on bucket)
Failure Handling and Atomicity
Chunk upload failure:
- Client retries the failed chunk (same
upload_idandchunk-index) - S3
UploadPartis idempotent for the same part number - Session tracks which chunks succeeded
Client disconnect mid-upload:
- Session remains in Cassandra with received chunks recorded
- Client can call
get-upload-statusto see what's missing - Resume by uploading only missing chunks, then
complete-upload
Complete-upload failure:
- S3
CompleteMultipartUploadis atomic - either succeeds fully or fails - On failure, parts remain and client can retry
complete-upload - No partial document is ever visible
Session expiry:
- Cassandra TTL deletes session record after 24 hours
- S3 bucket lifecycle policy cleans up incomplete multipart uploads
- No manual cleanup required
S3 Multipart Atomicity
S3 multipart uploads provide built-in atomicity:
-
Parts are invisible: Uploaded parts cannot be accessed as objects. They exist only as parts of an incomplete multipart upload.
-
Atomic completion:
CompleteMultipartUploadeither succeeds (object appears atomically) or fails (no object created). No partial state. -
No rename needed: The final object key is specified at
CreateMultipartUploadtime. Parts are coalesced directly to that key. -
Server-side coalesce: S3 combines parts internally. The librarian never reads parts back - zero memory overhead regardless of document size.
BlobStore Extensions
File: trustgraph-flow/trustgraph/librarian/blob_store.py
Add multipart upload methods:
class BlobStore:
# Existing methods...
def create_multipart_upload(self, object_id: UUID, kind: str) -> str:
"""Initialize multipart upload, return s3_upload_id."""
# minio client: create_multipart_upload()
def upload_part(
self, object_id: UUID, s3_upload_id: str,
part_number: int, data: bytes
) -> str:
"""Upload a single part, return etag."""
# minio client: upload_part()
# Note: S3 part numbers are 1-indexed
def complete_multipart_upload(
self, object_id: UUID, s3_upload_id: str,
parts: List[Tuple[int, str]] # [(part_number, etag), ...]
) -> None:
"""Finalize multipart upload."""
# minio client: complete_multipart_upload()
def abort_multipart_upload(
self, object_id: UUID, s3_upload_id: str
) -> None:
"""Cancel multipart upload, clean up parts."""
# minio client: abort_multipart_upload()
Chunk Size Considerations
- S3 minimum: 5MB per part (except last part)
- S3 maximum: 10,000 parts per upload
- Practical default: 5MB chunks
- 500MB document = 100 chunks
- 5GB document = 1,000 chunks
- Progress granularity: Smaller chunks = finer progress updates
- Network efficiency: Larger chunks = fewer round trips
Chunk size could be client-configurable within bounds (5MB - 100MB).
Document Processing: Streaming Retrieval
The upload flow addresses getting documents into storage efficiently. The processing flow addresses extracting and chunking documents without loading them entirely into memory.
Design Principle: Identifier, Not Content
Currently, when processing is triggered, document content flows through Pulsar messages. This loads entire documents into memory. Instead:
- Pulsar messages carry only the document identifier
- Processors fetch document content directly from librarian
- Fetching happens as a stream to temporary file
- Document-specific parsing (PDF, text, etc.) works with files, not memory buffers
This keeps the librarian document-structure-agnostic. PDF parsing, text extraction, and other format-specific logic stays in the respective decoders.
Processing Flow
Pulsar PDF Decoder Librarian S3
│ │ │ │
│── doc-id ───────────►│ │ │
│ (processing msg) │ │ │
│ │ │ │
│ │── stream-document ──────►│ │
│ │ (doc-id) │── GetObject ────►│
│ │ │ │
│ │◄── chunk ────────────────│◄── stream ───────│
│ │ (write to temp file) │ │
│ │◄── chunk ────────────────│◄── stream ───────│
│ │ (append to temp file) │ │
│ │ ⋮ │ ⋮ │
│ │◄── EOF ──────────────────│ │
│ │ │ │
│ │ ┌──────────────────────────┐ │
│ │ │ temp file on disk │ │
│ │ │ (memory stays bounded) │ │
│ │ └────────────┬─────────────┘ │
│ │ │ │
│ │ PDF library opens file │
│ │ extract page 1 text ──► chunker │
│ │ extract page 2 text ──► chunker │
│ │ ⋮ │
│ │ close file │
│ │ delete temp file │
Librarian Stream API
Add a streaming document retrieval operation:
stream-document
Request:
{
"operation": "stream-document",
"document-id": "doc-123"
}
Response: Streamed binary chunks (not a single response).
For REST API, this returns a streaming response with Transfer-Encoding: chunked.
For internal service-to-service calls (processor to librarian), this could be:
- Direct S3 streaming via presigned URL (if internal network allows)
- Chunked responses over the service protocol
- A dedicated streaming endpoint
The key requirement: data flows in chunks, never fully buffered in librarian.
PDF Decoder Changes
Current implementation (memory-intensive):
def decode_pdf(document_content: bytes) -> str:
reader = PdfReader(BytesIO(document_content)) # full doc in memory
text = ""
for page in reader.pages:
text += page.extract_text() # accumulating
return text # full text in memory
New implementation (temp file, incremental):
def decode_pdf_streaming(doc_id: str, librarian_client) -> Iterator[str]:
"""Yield extracted text page by page."""
with tempfile.NamedTemporaryFile(delete=True, suffix='.pdf') as tmp:
# Stream document to temp file
for chunk in librarian_client.stream_document(doc_id):
tmp.write(chunk)
tmp.flush()
# Open PDF from file (not memory)
reader = PdfReader(tmp.name)
# Yield pages incrementally
for page in reader.pages:
yield page.extract_text()
# tmp file auto-deleted on context exit
Memory profile:
- Temp file on disk: size of PDF (disk is cheap)
- In memory: one page's text at a time
- Peak memory: bounded, independent of document size
Text Document Decoder Changes
For plain text documents, even simpler - no temp file needed:
def decode_text_streaming(doc_id: str, librarian_client) -> Iterator[str]:
"""Yield text in chunks as it streams from storage."""
buffer = ""
for chunk in librarian_client.stream_document(doc_id):
buffer += chunk.decode('utf-8')
# Yield complete lines/paragraphs as they arrive
while '\n\n' in buffer:
paragraph, buffer = buffer.split('\n\n', 1)
yield paragraph + '\n\n'
# Yield remaining buffer
if buffer:
yield buffer
Text documents can stream directly without temp file since they're linearly structured.
Streaming Chunker Integration
The chunker receives an iterator of text (pages or paragraphs) and produces chunks incrementally:
class StreamingChunker:
def __init__(self, chunk_size: int, overlap: int):
self.chunk_size = chunk_size
self.overlap = overlap
def process(self, text_stream: Iterator[str]) -> Iterator[str]:
"""Yield chunks as text arrives."""
buffer = ""
for text_segment in text_stream:
buffer += text_segment
while len(buffer) >= self.chunk_size:
chunk = buffer[:self.chunk_size]
yield chunk
# Keep overlap for context continuity
buffer = buffer[self.chunk_size - self.overlap:]
# Yield remaining buffer as final chunk
if buffer.strip():
yield buffer
End-to-End Processing Pipeline
async def process_document(doc_id: str, librarian_client, embedder):
"""Process document with bounded memory."""
# Get document metadata to determine type
metadata = await librarian_client.get_document_metadata(doc_id)
# Select decoder based on document type
if metadata.kind == 'application/pdf':
text_stream = decode_pdf_streaming(doc_id, librarian_client)
elif metadata.kind == 'text/plain':
text_stream = decode_text_streaming(doc_id, librarian_client)
else:
raise UnsupportedDocumentType(metadata.kind)
# Chunk incrementally
chunker = StreamingChunker(chunk_size=1000, overlap=100)
# Process each chunk as it's produced
for chunk in chunker.process(text_stream):
# Generate embeddings, store in vector DB, etc.
embedding = await embedder.embed(chunk)
await store_chunk(doc_id, chunk, embedding)
At no point is the full document or full extracted text held in memory.
Temp File Considerations
Location: Use system temp directory (/tmp or equivalent). For
containerized deployments, ensure temp directory has sufficient space
and is on fast storage (not network-mounted if possible).
Cleanup: Use context managers (with tempfile...) to ensure cleanup
even on exceptions.
Concurrent processing: Each processing job gets its own temp file. No conflicts between parallel document processing.
Disk space: Temp files are short-lived (duration of processing). For a 500MB PDF, need 500MB temp space during processing. Size limit could be enforced at upload time if disk space is constrained.
Unified Processing Interface: Child Documents
PDF extraction and text document processing need to feed into the same downstream pipeline (chunker → embeddings → storage). To achieve this with a consistent "fetch by ID" interface, extracted text blobs are stored back to librarian as child documents.
Processing Flow with Child Documents
PDF Document Text Document
│ │
▼ │
pdf-extractor │
│ │
│ (stream PDF from librarian) │
│ (extract page 1 text) │
│ (store as child doc → librarian) │
│ (extract page 2 text) │
│ (store as child doc → librarian) │
│ ⋮ │
▼ ▼
[child-doc-id, child-doc-id, ...] [doc-id]
│ │
└─────────────────────┬───────────────────────────────┘
▼
chunker
│
│ (receives document ID)
│ (streams content from librarian)
│ (chunks incrementally)
▼
[chunks → embedding → storage]
The chunker has one uniform interface:
- Receive a document ID (via Pulsar)
- Stream content from librarian
- Chunk it
It doesn't know or care whether the ID refers to:
- A user-uploaded text document
- An extracted text blob from a PDF page
- Any future document type
Child Document Metadata
Extend the document schema to track parent/child relationships:
-- Add columns to document table
ALTER TABLE document ADD parent_id text;
ALTER TABLE document ADD document_type text;
-- Index for finding children of a parent
CREATE INDEX document_parent ON document (parent_id);
Document types:
document_type |
Description |
|---|---|
source |
User-uploaded document (PDF, text, etc.) |
extracted |
Derived from a source document (e.g., PDF page text) |
Metadata fields:
| Field | Source Document | Extracted Child |
|---|---|---|
id |
user-provided or generated | generated (e.g., {parent-id}-page-{n}) |
parent_id |
NULL |
parent document ID |
document_type |
source |
extracted |
kind |
application/pdf, etc. |
text/plain |
title |
user-provided | generated (e.g., "Page 3 of Report.pdf") |
user |
authenticated user | same as parent |
Librarian API for Child Documents
Creating child documents (internal, used by pdf-extractor):
{
"operation": "add-child-document",
"parent-id": "doc-123",
"document-metadata": {
"id": "doc-123-page-1",
"kind": "text/plain",
"title": "Page 1"
},
"content": "<base64-encoded-text>"
}
For small extracted text (typical page text is < 100KB), single-operation upload is acceptable. For very large text extractions, chunked upload could be used.
Listing child documents (for debugging/admin):
{
"operation": "list-children",
"parent-id": "doc-123"
}
Response:
{
"children": [
{ "id": "doc-123-page-1", "title": "Page 1", "kind": "text/plain" },
{ "id": "doc-123-page-2", "title": "Page 2", "kind": "text/plain" },
...
]
}
User-Facing Behavior
list-documents default behavior:
SELECT * FROM document WHERE user = ? AND parent_id IS NULL;
Only top-level (source) documents appear in the user's document list. Child documents are filtered out by default.
Optional include-children flag (for admin/debugging):
{
"operation": "list-documents",
"include-children": true
}
Cascade Delete
When a parent document is deleted, all children must be deleted:
def delete_document(doc_id: str):
# Find all children
children = query("SELECT id, object_id FROM document WHERE parent_id = ?", doc_id)
# Delete child blobs from S3
for child in children:
blob_store.delete(child.object_id)
# Delete child metadata from Cassandra
execute("DELETE FROM document WHERE parent_id = ?", doc_id)
# Delete parent blob and metadata
parent = get_document(doc_id)
blob_store.delete(parent.object_id)
execute("DELETE FROM document WHERE id = ? AND user = ?", doc_id, user)
Storage Considerations
Extracted text blobs do duplicate content:
- Original PDF stored in Garage
- Extracted text per page also stored in Garage
This tradeoff enables:
- Uniform chunker interface: Chunker always fetches by ID
- Resume/retry: Can restart at chunker stage without re-extracting PDF
- Debugging: Extracted text is inspectable
- Separation of concerns: PDF extractor and chunker are independent services
For a 500MB PDF with 200 pages averaging 5KB text per page:
- PDF storage: 500MB
- Extracted text storage: ~1MB total
- Overhead: negligible
PDF Extractor Output
The pdf-extractor, after processing a document:
- Streams PDF from librarian to temp file
- Extracts text page by page
- For each page, stores extracted text as child document via librarian
- Sends child document IDs to chunker queue
async def extract_pdf(doc_id: str, librarian_client, output_queue):
"""Extract PDF pages and store as child documents."""
with tempfile.NamedTemporaryFile(delete=True, suffix='.pdf') as tmp:
# Stream PDF to temp file
for chunk in librarian_client.stream_document(doc_id):
tmp.write(chunk)
tmp.flush()
# Extract pages
reader = PdfReader(tmp.name)
for page_num, page in enumerate(reader.pages, start=1):
text = page.extract_text()
# Store as child document
child_id = f"{doc_id}-page-{page_num}"
await librarian_client.add_child_document(
parent_id=doc_id,
document_id=child_id,
kind="text/plain",
title=f"Page {page_num}",
content=text.encode('utf-8')
)
# Send to chunker queue
await output_queue.send(child_id)
The chunker receives these child IDs and processes them identically to how it would process a user-uploaded text document.
Client Updates
Python SDK
The Python SDK (trustgraph-base/trustgraph/api/library.py) should handle
chunked uploads transparently. The public interface remains unchanged:
# Existing interface - no change for users
library.add_document(
id="doc-123",
title="Large Report",
kind="application/pdf",
content=large_pdf_bytes, # Can be hundreds of MB
tags=["reports"]
)
Internally, the SDK detects document size and switches strategy:
class Library:
CHUNKED_UPLOAD_THRESHOLD = 2 * 1024 * 1024 # 2MB
def add_document(self, id, title, kind, content, tags=None, ...):
if len(content) < self.CHUNKED_UPLOAD_THRESHOLD:
# Small document: single operation (existing behavior)
return self._add_document_single(id, title, kind, content, tags)
else:
# Large document: chunked upload
return self._add_document_chunked(id, title, kind, content, tags)
def _add_document_chunked(self, id, title, kind, content, tags):
# 1. begin-upload
session = self._begin_upload(
document_metadata={...},
total_size=len(content),
chunk_size=5 * 1024 * 1024
)
# 2. upload-chunk for each chunk
for i, chunk in enumerate(self._chunk_bytes(content, session.chunk_size)):
self._upload_chunk(session.upload_id, i, chunk)
# 3. complete-upload
return self._complete_upload(session.upload_id)
Progress callbacks (optional enhancement):
def add_document(self, ..., on_progress=None):
"""
on_progress: Optional callback(bytes_sent, total_bytes)
"""
This allows UIs to display upload progress without changing the basic API.
CLI Tools
tg-add-library-document continues to work unchanged:
# Works transparently for any size - SDK handles chunking internally
tg-add-library-document --file large-report.pdf --title "Large Report"
Optional progress display could be added:
tg-add-library-document --file large-report.pdf --title "Large Report" --progress
# Output:
# Uploading: 45% (225MB / 500MB)
Legacy tools removed:
tg-load-pdf- deprecated, usetg-add-library-documenttg-load-text- deprecated, usetg-add-library-document
Admin/debug commands (optional, low priority):
# List incomplete uploads (admin troubleshooting)
tg-add-library-document --list-pending
# Resume specific upload (recovery scenario)
tg-add-library-document --resume upload-abc-123 --file large-report.pdf
These could be flags on the existing command rather than separate tools.
API Specification Updates
The OpenAPI spec (specs/api/paths/librarian.yaml) needs updates for:
New operations:
begin-upload- Initialize chunked upload sessionupload-chunk- Upload individual chunkcomplete-upload- Finalize uploadabort-upload- Cancel uploadget-upload-status- Query upload progresslist-uploads- List incomplete uploads for userstream-document- Streaming document retrievaladd-child-document- Store extracted text (internal)list-children- List child documents (admin)
Modified operations:
list-documents- Addinclude-childrenparameter
New schemas:
ChunkedUploadBeginRequestChunkedUploadBeginResponseChunkedUploadChunkRequestChunkedUploadChunkResponseUploadSessionUploadProgress
WebSocket spec updates (specs/websocket/):
Mirror the REST operations for WebSocket clients, enabling real-time progress updates during upload.
UX Considerations
The API spec updates enable frontend improvements:
Upload progress UI:
- Progress bar showing chunks uploaded
- Estimated time remaining
- Pause/resume capability
Error recovery:
- "Resume upload" option for interrupted uploads
- List of pending uploads on reconnect
Large file handling:
- Client-side file size detection
- Automatic chunked upload for large files
- Clear feedback during long uploads
These UX improvements require frontend work guided by the updated API spec.