mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-04-25 08:26:21 +02:00

Incremental / large document loading (#659 )

Tech spec

BlobStore (trustgraph-flow/trustgraph/librarian/blob_store.py):
- get_stream() - yields document content in chunks for streaming retrieval
- create_multipart_upload() - initializes S3 multipart upload, returns
  upload_id
- upload_part() - uploads a single part, returns etag
- complete_multipart_upload() - finalizes upload with part etags
- abort_multipart_upload() - cancels and cleans up

Cassandra schema (trustgraph-flow/trustgraph/tables/library.py):
- New upload_session table with 24-hour TTL
- Index on user for listing sessions
- Prepared statements for all operations
- Methods: create_upload_session(), get_upload_session(),
  update_upload_session_chunk(), delete_upload_session(),
  list_upload_sessions()

- Schema extended with UploadSession, UploadProgress, and new
  request/response fields
- Librarian methods: begin_upload, upload_chunk, complete_upload,
  abort_upload, get_upload_status, list_uploads
- Service routing for all new operations
- Python SDK with transparent chunked upload:
  - add_document() auto-switches to chunked for files > 10MB
  - Progress callback support (on_progress)
  - get_pending_uploads(), get_upload_status(), abort_upload(),
    resume_upload()

- Document table: Added parent_id and document_type columns with index
- Document schema (knowledge/document.py): Added document_id field for
  streaming retrieval
- Librarian operations:
  - add-child-document for extracted PDF pages
  - list-children to get child documents
  - stream-document for chunked content retrieval
  - Cascade delete removes children when parent is deleted
  - list-documents filters children by default
- PDF decoder (decoding/pdf/pdf_decoder.py): Updated to stream large
  documents from librarian API to temp file
- Librarian service (librarian/service.py): Sends document_id instead of
  content for large PDFs (>2MB)
- Deprecated tools (load_pdf.py, load_text.py): Added deprecation
  warnings directing users to tg-add-library-document +
  tg-start-library-processing

Remove load_pdf and load_text utils

Move chunker/librarian comms to base class

Updating tests

2026-03-04 16:57:58 +00:00

32 KiB

Raw Blame History

Large Document Loading Technical Specification

Overview

This specification addresses scalability and user experience issues when loading large documents into TrustGraph. The current architecture treats document upload as a single atomic operation, causing memory pressure at multiple points in the pipeline and providing no feedback or recovery options to users.

This implementation targets the following use cases:

Large PDF Processing: Upload and process multi-hundred-megabyte PDF files without exhausting memory
Resumable Uploads: Allow interrupted uploads to continue from where they left off rather than restarting
Progress Feedback: Provide users with real-time visibility into upload and processing progress
Memory-Efficient Processing: Process documents in a streaming fashion without holding entire files in memory

Goals

Incremental Upload: Support chunked document upload via REST and WebSocket
Resumable Transfers: Enable recovery from interrupted uploads
Progress Visibility: Provide upload/processing progress feedback to clients
Memory Efficiency: Eliminate full-document buffering throughout the pipeline
Backward Compatibility: Existing small-document workflows continue unchanged
Streaming Processing: PDF decoding and text chunking operate on streams

Background

Current Architecture

Document submission flows through the following path:

Client submits document via REST (POST /api/v1/librarian) or WebSocket
API Gateway receives complete request with base64-encoded document content
LibrarianRequestor translates request to Pulsar message
Librarian Service receives message, decodes document into memory
BlobStore uploads document to Garage/S3
Cassandra stores metadata with object reference
For processing: document retrieved from S3, decoded, chunked—all in memory

Key files:

REST/WebSocket entry: trustgraph-flow/trustgraph/gateway/service.py
Librarian core: trustgraph-flow/trustgraph/librarian/librarian.py
Blob storage: trustgraph-flow/trustgraph/librarian/blob_store.py
Cassandra tables: trustgraph-flow/trustgraph/tables/library.py
API schema: trustgraph-base/trustgraph/schema/services/library.py

Current Limitations

The current design has several compounding memory and UX issues:

Atomic Upload Operation: The entire document must be transmitted in a single request. Large documents require long-running requests with no progress indication and no retry mechanism if the connection fails.
API Design: Both REST and WebSocket APIs expect the complete document in a single message. The schema (LibrarianRequest) has a single content field containing the entire base64-encoded document.
Librarian Memory: The librarian service decodes the entire document into memory before uploading to S3. For a 500MB PDF, this means holding 500MB+ in process memory.
PDF Decoder Memory: When processing begins, the PDF decoder loads the entire PDF into memory to extract text. PyPDF and similar libraries typically require full document access.
Chunker Memory: The text chunker receives the complete extracted text and holds it in memory while producing chunks.

Memory Impact Example (500MB PDF):

Gateway: ~700MB (base64 encoding overhead)
Librarian: ~500MB (decoded bytes)
PDF Decoder: ~500MB + extraction buffers
Chunker: extracted text (variable, potentially 100MB+)

Total peak memory can exceed 2GB for a single large document.

Technical Design

Design Principles

API Facade: All client interaction goes through the librarian API. Clients have no direct access to or knowledge of the underlying S3/Garage storage.
S3 Multipart Upload: Use standard S3 multipart upload under the hood. This is widely supported across S3-compatible systems (AWS S3, MinIO, Garage, Ceph, DigitalOcean Spaces, Backblaze B2, etc.) ensuring portability.
Atomic Completion: S3 multipart uploads are inherently atomic - uploaded parts are invisible until CompleteMultipartUpload is called. No temporary files or rename operations needed.
Trackable State: Upload sessions tracked in Cassandra, providing visibility into incomplete uploads and enabling resume capability.

Chunked Upload Flow

Client                    Librarian API                   S3/Garage
  │                            │                              │
  │── begin-upload ───────────►│                              │
  │   (metadata, size)         │── CreateMultipartUpload ────►│
  │                            │◄── s3_upload_id ─────────────│
  │◄── upload_id ──────────────│   (store session in          │
  │                            │    Cassandra)                │
  │                            │                              │
  │── upload-chunk ───────────►│                              │
  │   (upload_id, index, data) │── UploadPart ───────────────►│
  │                            │◄── etag ─────────────────────│
  │◄── ack + progress ─────────│   (store etag in session)    │
  │         ⋮                  │         ⋮                    │
  │   (repeat for all chunks)  │                              │
  │                            │                              │
  │── complete-upload ────────►│                              │
  │   (upload_id)              │── CompleteMultipartUpload ──►│
  │                            │   (parts coalesced by S3)    │
  │                            │── store doc metadata ───────►│ Cassandra
  │◄── document_id ────────────│   (delete session)           │

The client never interacts with S3 directly. The librarian translates between our chunked upload API and S3 multipart operations internally.

Librarian API Operations

`begin-upload`

Initialize a chunked upload session.

Request:

{
  "operation": "begin-upload",
  "document-metadata": {
    "id": "doc-123",
    "kind": "application/pdf",
    "title": "Large Document",
    "user": "user-id",
    "tags": ["tag1", "tag2"]
  },
  "total-size": 524288000,
  "chunk-size": 5242880
}

Response:

{
  "upload-id": "upload-abc-123",
  "chunk-size": 5242880,
  "total-chunks": 100
}

The librarian:

Generates a unique upload_id and object_id (UUID for blob storage)
Calls S3 CreateMultipartUpload, receives s3_upload_id
Creates session record in Cassandra
Returns upload_id to client

`upload-chunk`

Upload a single chunk.

Request:

{
  "operation": "upload-chunk",
  "upload-id": "upload-abc-123",
  "chunk-index": 0,
  "content": "<base64-encoded-chunk>"
}

Response:

{
  "upload-id": "upload-abc-123",
  "chunk-index": 0,
  "chunks-received": 1,
  "total-chunks": 100,
  "bytes-received": 5242880,
  "total-bytes": 524288000
}

The librarian:

Looks up session by upload_id
Validates ownership (user must match session creator)
Calls S3 UploadPart with chunk data, receives etag
Updates session record with chunk index and etag
Returns progress to client

Failed chunks can be retried - just send the same chunk-index again.

`complete-upload`

Finalize the upload and create the document.

Request:

{
  "operation": "complete-upload",
  "upload-id": "upload-abc-123"
}

Response:

{
  "document-id": "doc-123",
  "object-id": "550e8400-e29b-41d4-a716-446655440000"
}

The librarian:

Looks up session, verifies all chunks received
Calls S3 CompleteMultipartUpload with part etags (S3 coalesces parts internally - zero memory cost to librarian)
Creates document record in Cassandra with metadata and object reference
Deletes upload session record
Returns document ID to client

`abort-upload`

Cancel an in-progress upload.

Request:

{
  "operation": "abort-upload",
  "upload-id": "upload-abc-123"
}

The librarian:

Calls S3 AbortMultipartUpload to clean up parts
Deletes session record from Cassandra

`get-upload-status`

Query status of an upload (for resume capability).

Request:

{
  "operation": "get-upload-status",
  "upload-id": "upload-abc-123"
}

Response:

{
  "upload-id": "upload-abc-123",
  "state": "in-progress",
  "chunks-received": [0, 1, 2, 5, 6],
  "missing-chunks": [3, 4, 7, 8],
  "total-chunks": 100,
  "bytes-received": 36700160,
  "total-bytes": 524288000
}

`list-uploads`

List incomplete uploads for a user.

Request:

{
  "operation": "list-uploads"
}

Response:

{
  "uploads": [
    {
      "upload-id": "upload-abc-123",
      "document-metadata": { "title": "Large Document", ... },
      "progress": { "chunks-received": 43, "total-chunks": 100 },
      "created-at": "2024-01-15T10:30:00Z"
    }
  ]
}

Upload Session Storage

Track in-progress uploads in Cassandra:

CREATE TABLE upload_session (
    upload_id text PRIMARY KEY,
    user text,
    document_id text,
    document_metadata text,      -- JSON: title, kind, tags, comments, etc.
    s3_upload_id text,           -- internal, for S3 operations
    object_id uuid,              -- target blob ID
    total_size bigint,
    chunk_size int,
    total_chunks int,
    chunks_received map<int, text>,  -- chunk_index → etag
    created_at timestamp,
    updated_at timestamp
) WITH default_time_to_live = 86400;  -- 24 hour TTL

CREATE INDEX upload_session_user ON upload_session (user);

TTL Behavior:

Sessions expire after 24 hours if not completed
When Cassandra TTL expires, the session record is deleted
Orphaned S3 parts are cleaned up by S3 lifecycle policy (configure on bucket)

Failure Handling and Atomicity

Chunk upload failure:

Client retries the failed chunk (same upload_id and chunk-index)
S3 UploadPart is idempotent for the same part number
Session tracks which chunks succeeded

Client disconnect mid-upload:

Session remains in Cassandra with received chunks recorded
Client can call get-upload-status to see what's missing
Resume by uploading only missing chunks, then complete-upload

Complete-upload failure:

S3 CompleteMultipartUpload is atomic - either succeeds fully or fails
On failure, parts remain and client can retry complete-upload
No partial document is ever visible

Session expiry:

Cassandra TTL deletes session record after 24 hours
S3 bucket lifecycle policy cleans up incomplete multipart uploads
No manual cleanup required

S3 Multipart Atomicity

S3 multipart uploads provide built-in atomicity:

Parts are invisible: Uploaded parts cannot be accessed as objects. They exist only as parts of an incomplete multipart upload.
Atomic completion: CompleteMultipartUpload either succeeds (object appears atomically) or fails (no object created). No partial state.
No rename needed: The final object key is specified at CreateMultipartUpload time. Parts are coalesced directly to that key.
Server-side coalesce: S3 combines parts internally. The librarian never reads parts back - zero memory overhead regardless of document size.

BlobStore Extensions

File: trustgraph-flow/trustgraph/librarian/blob_store.py

Add multipart upload methods:

class BlobStore:
    # Existing methods...

    def create_multipart_upload(self, object_id: UUID, kind: str) -> str:
        """Initialize multipart upload, return s3_upload_id."""
        # minio client: create_multipart_upload()

    def upload_part(
        self, object_id: UUID, s3_upload_id: str,
        part_number: int, data: bytes
    ) -> str:
        """Upload a single part, return etag."""
        # minio client: upload_part()
        # Note: S3 part numbers are 1-indexed

    def complete_multipart_upload(
        self, object_id: UUID, s3_upload_id: str,
        parts: List[Tuple[int, str]]  # [(part_number, etag), ...]
    ) -> None:
        """Finalize multipart upload."""
        # minio client: complete_multipart_upload()

    def abort_multipart_upload(
        self, object_id: UUID, s3_upload_id: str
    ) -> None:
        """Cancel multipart upload, clean up parts."""
        # minio client: abort_multipart_upload()

Chunk Size Considerations

S3 minimum: 5MB per part (except last part)
S3 maximum: 10,000 parts per upload
Practical default: 5MB chunks
- 500MB document = 100 chunks
- 5GB document = 1,000 chunks
Progress granularity: Smaller chunks = finer progress updates
Network efficiency: Larger chunks = fewer round trips

Chunk size could be client-configurable within bounds (5MB - 100MB).

Document Processing: Streaming Retrieval

The upload flow addresses getting documents into storage efficiently. The processing flow addresses extracting and chunking documents without loading them entirely into memory.

Design Principle: Identifier, Not Content

Currently, when processing is triggered, document content flows through Pulsar messages. This loads entire documents into memory. Instead:

Pulsar messages carry only the document identifier
Processors fetch document content directly from librarian
Fetching happens as a stream to temporary file
Document-specific parsing (PDF, text, etc.) works with files, not memory buffers

This keeps the librarian document-structure-agnostic. PDF parsing, text extraction, and other format-specific logic stays in the respective decoders.

Processing Flow

Pulsar              PDF Decoder                Librarian              S3
  │                      │                          │                  │
  │── doc-id ───────────►│                          │                  │
  │  (processing msg)    │                          │                  │
  │                      │                          │                  │
  │                      │── stream-document ──────►│                  │
  │                      │   (doc-id)               │── GetObject ────►│
  │                      │                          │                  │
  │                      │◄── chunk ────────────────│◄── stream ───────│
  │                      │   (write to temp file)   │                  │
  │                      │◄── chunk ────────────────│◄── stream ───────│
  │                      │   (append to temp file)  │                  │
  │                      │         ⋮                │         ⋮        │
  │                      │◄── EOF ──────────────────│                  │
  │                      │                          │                  │
  │                      │   ┌──────────────────────────┐              │
  │                      │   │ temp file on disk        │              │
  │                      │   │ (memory stays bounded)   │              │
  │                      │   └────────────┬─────────────┘              │
  │                      │                │                            │
  │                      │   PDF library opens file                    │
  │                      │   extract page 1 text ──►  chunker          │
  │                      │   extract page 2 text ──►  chunker          │
  │                      │         ⋮                                   │
  │                      │   close file                                │
  │                      │   delete temp file                          │

Librarian Stream API

Add a streaming document retrieval operation:

stream-document

Request:

{
  "operation": "stream-document",
  "document-id": "doc-123"
}

Response: Streamed binary chunks (not a single response).

For REST API, this returns a streaming response with Transfer-Encoding: chunked.

For internal service-to-service calls (processor to librarian), this could be:

Direct S3 streaming via presigned URL (if internal network allows)
Chunked responses over the service protocol
A dedicated streaming endpoint

The key requirement: data flows in chunks, never fully buffered in librarian.

PDF Decoder Changes

Current implementation (memory-intensive):

def decode_pdf(document_content: bytes) -> str:
    reader = PdfReader(BytesIO(document_content))  # full doc in memory
    text = ""
    for page in reader.pages:
        text += page.extract_text()  # accumulating
    return text  # full text in memory

New implementation (temp file, incremental):

def decode_pdf_streaming(doc_id: str, librarian_client) -> Iterator[str]:
    """Yield extracted text page by page."""

    with tempfile.NamedTemporaryFile(delete=True, suffix='.pdf') as tmp:
        # Stream document to temp file
        for chunk in librarian_client.stream_document(doc_id):
            tmp.write(chunk)
        tmp.flush()

        # Open PDF from file (not memory)
        reader = PdfReader(tmp.name)

        # Yield pages incrementally
        for page in reader.pages:
            yield page.extract_text()

        # tmp file auto-deleted on context exit

Memory profile:

Temp file on disk: size of PDF (disk is cheap)
In memory: one page's text at a time
Peak memory: bounded, independent of document size

Text Document Decoder Changes

For plain text documents, even simpler - no temp file needed:

def decode_text_streaming(doc_id: str, librarian_client) -> Iterator[str]:
    """Yield text in chunks as it streams from storage."""

    buffer = ""
    for chunk in librarian_client.stream_document(doc_id):
        buffer += chunk.decode('utf-8')

        # Yield complete lines/paragraphs as they arrive
        while '\n\n' in buffer:
            paragraph, buffer = buffer.split('\n\n', 1)
            yield paragraph + '\n\n'

    # Yield remaining buffer
    if buffer:
        yield buffer

Text documents can stream directly without temp file since they're linearly structured.

Streaming Chunker Integration

The chunker receives an iterator of text (pages or paragraphs) and produces chunks incrementally:

class StreamingChunker:
    def __init__(self, chunk_size: int, overlap: int):
        self.chunk_size = chunk_size
        self.overlap = overlap

    def process(self, text_stream: Iterator[str]) -> Iterator[str]:
        """Yield chunks as text arrives."""
        buffer = ""

        for text_segment in text_stream:
            buffer += text_segment

            while len(buffer) >= self.chunk_size:
                chunk = buffer[:self.chunk_size]
                yield chunk
                # Keep overlap for context continuity
                buffer = buffer[self.chunk_size - self.overlap:]

        # Yield remaining buffer as final chunk
        if buffer.strip():
            yield buffer

End-to-End Processing Pipeline

async def process_document(doc_id: str, librarian_client, embedder):
    """Process document with bounded memory."""

    # Get document metadata to determine type
    metadata = await librarian_client.get_document_metadata(doc_id)

    # Select decoder based on document type
    if metadata.kind == 'application/pdf':
        text_stream = decode_pdf_streaming(doc_id, librarian_client)
    elif metadata.kind == 'text/plain':
        text_stream = decode_text_streaming(doc_id, librarian_client)
    else:
        raise UnsupportedDocumentType(metadata.kind)

    # Chunk incrementally
    chunker = StreamingChunker(chunk_size=1000, overlap=100)

    # Process each chunk as it's produced
    for chunk in chunker.process(text_stream):
        # Generate embeddings, store in vector DB, etc.
        embedding = await embedder.embed(chunk)
        await store_chunk(doc_id, chunk, embedding)

At no point is the full document or full extracted text held in memory.

Temp File Considerations

Location: Use system temp directory (/tmp or equivalent). For containerized deployments, ensure temp directory has sufficient space and is on fast storage (not network-mounted if possible).

Cleanup: Use context managers (with tempfile...) to ensure cleanup even on exceptions.

Concurrent processing: Each processing job gets its own temp file. No conflicts between parallel document processing.

Disk space: Temp files are short-lived (duration of processing). For a 500MB PDF, need 500MB temp space during processing. Size limit could be enforced at upload time if disk space is constrained.

Unified Processing Interface: Child Documents

PDF extraction and text document processing need to feed into the same downstream pipeline (chunker → embeddings → storage). To achieve this with a consistent "fetch by ID" interface, extracted text blobs are stored back to librarian as child documents.

Processing Flow with Child Documents

PDF Document                                         Text Document
     │                                                     │
     ▼                                                     │
pdf-extractor                                              │
     │                                                     │
     │ (stream PDF from librarian)                         │
     │ (extract page 1 text)                               │
     │ (store as child doc → librarian)                    │
     │ (extract page 2 text)                               │
     │ (store as child doc → librarian)                    │
     │         ⋮                                           │
     ▼                                                     ▼
[child-doc-id, child-doc-id, ...]                    [doc-id]
     │                                                     │
     └─────────────────────┬───────────────────────────────┘
                           ▼
                       chunker
                           │
                           │ (receives document ID)
                           │ (streams content from librarian)
                           │ (chunks incrementally)
                           ▼
                    [chunks → embedding → storage]

The chunker has one uniform interface:

Receive a document ID (via Pulsar)
Stream content from librarian
Chunk it

It doesn't know or care whether the ID refers to:

A user-uploaded text document
An extracted text blob from a PDF page
Any future document type

Child Document Metadata

Extend the document schema to track parent/child relationships:

-- Add columns to document table
ALTER TABLE document ADD parent_id text;
ALTER TABLE document ADD document_type text;

-- Index for finding children of a parent
CREATE INDEX document_parent ON document (parent_id);

Document types:

`document_type`	Description
`source`	User-uploaded document (PDF, text, etc.)
`extracted`	Derived from a source document (e.g., PDF page text)

Metadata fields:

Field	Source Document	Extracted Child
`id`	user-provided or generated	generated (e.g., `{parent-id}-page-{n}`)
`parent_id`	`NULL`	parent document ID
`document_type`	`source`	`extracted`
`kind`	`application/pdf`, etc.	`text/plain`
`title`	user-provided	generated (e.g., "Page 3 of Report.pdf")
`user`	authenticated user	same as parent

Librarian API for Child Documents

Creating child documents (internal, used by pdf-extractor):

{
  "operation": "add-child-document",
  "parent-id": "doc-123",
  "document-metadata": {
    "id": "doc-123-page-1",
    "kind": "text/plain",
    "title": "Page 1"
  },
  "content": "<base64-encoded-text>"
}

For small extracted text (typical page text is < 100KB), single-operation upload is acceptable. For very large text extractions, chunked upload could be used.

Listing child documents (for debugging/admin):

{
  "operation": "list-children",
  "parent-id": "doc-123"
}

Response:

{
  "children": [
    { "id": "doc-123-page-1", "title": "Page 1", "kind": "text/plain" },
    { "id": "doc-123-page-2", "title": "Page 2", "kind": "text/plain" },
    ...
  ]
}

User-Facing Behavior

list-documents default behavior:

SELECT * FROM document WHERE user = ? AND parent_id IS NULL;

Only top-level (source) documents appear in the user's document list. Child documents are filtered out by default.

Optional include-children flag (for admin/debugging):

{
  "operation": "list-documents",
  "include-children": true
}

Cascade Delete

When a parent document is deleted, all children must be deleted:

def delete_document(doc_id: str):
    # Find all children
    children = query("SELECT id, object_id FROM document WHERE parent_id = ?", doc_id)

    # Delete child blobs from S3
    for child in children:
        blob_store.delete(child.object_id)

    # Delete child metadata from Cassandra
    execute("DELETE FROM document WHERE parent_id = ?", doc_id)

    # Delete parent blob and metadata
    parent = get_document(doc_id)
    blob_store.delete(parent.object_id)
    execute("DELETE FROM document WHERE id = ? AND user = ?", doc_id, user)

Storage Considerations

Extracted text blobs do duplicate content:

Original PDF stored in Garage
Extracted text per page also stored in Garage

This tradeoff enables:

Uniform chunker interface: Chunker always fetches by ID
Resume/retry: Can restart at chunker stage without re-extracting PDF
Debugging: Extracted text is inspectable
Separation of concerns: PDF extractor and chunker are independent services

For a 500MB PDF with 200 pages averaging 5KB text per page:

PDF storage: 500MB
Extracted text storage: ~1MB total
Overhead: negligible

PDF Extractor Output

The pdf-extractor, after processing a document:

Streams PDF from librarian to temp file
Extracts text page by page
For each page, stores extracted text as child document via librarian
Sends child document IDs to chunker queue

async def extract_pdf(doc_id: str, librarian_client, output_queue):
    """Extract PDF pages and store as child documents."""

    with tempfile.NamedTemporaryFile(delete=True, suffix='.pdf') as tmp:
        # Stream PDF to temp file
        for chunk in librarian_client.stream_document(doc_id):
            tmp.write(chunk)
        tmp.flush()

        # Extract pages
        reader = PdfReader(tmp.name)
        for page_num, page in enumerate(reader.pages, start=1):
            text = page.extract_text()

            # Store as child document
            child_id = f"{doc_id}-page-{page_num}"
            await librarian_client.add_child_document(
                parent_id=doc_id,
                document_id=child_id,
                kind="text/plain",
                title=f"Page {page_num}",
                content=text.encode('utf-8')
            )

            # Send to chunker queue
            await output_queue.send(child_id)

The chunker receives these child IDs and processes them identically to how it would process a user-uploaded text document.

Client Updates

Python SDK

The Python SDK (trustgraph-base/trustgraph/api/library.py) should handle chunked uploads transparently. The public interface remains unchanged:

# Existing interface - no change for users
library.add_document(
    id="doc-123",
    title="Large Report",
    kind="application/pdf",
    content=large_pdf_bytes,  # Can be hundreds of MB
    tags=["reports"]
)

Internally, the SDK detects document size and switches strategy:

class Library:
    CHUNKED_UPLOAD_THRESHOLD = 2 * 1024 * 1024  # 2MB

    def add_document(self, id, title, kind, content, tags=None, ...):
        if len(content) < self.CHUNKED_UPLOAD_THRESHOLD:
            # Small document: single operation (existing behavior)
            return self._add_document_single(id, title, kind, content, tags)
        else:
            # Large document: chunked upload
            return self._add_document_chunked(id, title, kind, content, tags)

    def _add_document_chunked(self, id, title, kind, content, tags):
        # 1. begin-upload
        session = self._begin_upload(
            document_metadata={...},
            total_size=len(content),
            chunk_size=5 * 1024 * 1024
        )

        # 2. upload-chunk for each chunk
        for i, chunk in enumerate(self._chunk_bytes(content, session.chunk_size)):
            self._upload_chunk(session.upload_id, i, chunk)

        # 3. complete-upload
        return self._complete_upload(session.upload_id)

Progress callbacks (optional enhancement):

def add_document(self, ..., on_progress=None):
    """
    on_progress: Optional callback(bytes_sent, total_bytes)
    """

This allows UIs to display upload progress without changing the basic API.

CLI Tools

tg-add-library-document continues to work unchanged:

# Works transparently for any size - SDK handles chunking internally
tg-add-library-document --file large-report.pdf --title "Large Report"

Optional progress display could be added:

tg-add-library-document --file large-report.pdf --title "Large Report" --progress
# Output:
# Uploading: 45% (225MB / 500MB)

Legacy tools removed:

tg-load-pdf - deprecated, use tg-add-library-document
tg-load-text - deprecated, use tg-add-library-document

Admin/debug commands (optional, low priority):

# List incomplete uploads (admin troubleshooting)
tg-add-library-document --list-pending

# Resume specific upload (recovery scenario)
tg-add-library-document --resume upload-abc-123 --file large-report.pdf

These could be flags on the existing command rather than separate tools.

API Specification Updates

The OpenAPI spec (specs/api/paths/librarian.yaml) needs updates for:

New operations:

begin-upload - Initialize chunked upload session
upload-chunk - Upload individual chunk
complete-upload - Finalize upload
abort-upload - Cancel upload
get-upload-status - Query upload progress
list-uploads - List incomplete uploads for user
stream-document - Streaming document retrieval
add-child-document - Store extracted text (internal)
list-children - List child documents (admin)

Modified operations:

list-documents - Add include-children parameter

New schemas:

ChunkedUploadBeginRequest
ChunkedUploadBeginResponse
ChunkedUploadChunkRequest
ChunkedUploadChunkResponse
UploadSession
UploadProgress

WebSocket spec updates (specs/websocket/):

Mirror the REST operations for WebSocket clients, enabling real-time progress updates during upload.

UX Considerations

The API spec updates enable frontend improvements:

Upload progress UI:

Progress bar showing chunks uploaded
Estimated time remaining
Pause/resume capability

Error recovery:

"Resume upload" option for interrupted uploads
List of pending uploads on reconnect

Large file handling:

Client-side file size detection
Automatic chunked upload for large files
Clear feedback during long uploads

These UX improvements require frontend work guided by the updated API spec.

32 KiB Raw Blame History

Large Document Loading Technical Specification

Overview

Goals

Background

Current Architecture

Current Limitations

Technical Design

Design Principles

Chunked Upload Flow

Librarian API Operations

begin-upload

upload-chunk

complete-upload

abort-upload

get-upload-status

list-uploads

Upload Session Storage

Failure Handling and Atomicity

S3 Multipart Atomicity

BlobStore Extensions

Chunk Size Considerations

Document Processing: Streaming Retrieval

Design Principle: Identifier, Not Content

Processing Flow

Librarian Stream API

PDF Decoder Changes

Text Document Decoder Changes

Streaming Chunker Integration

End-to-End Processing Pipeline

Temp File Considerations

Unified Processing Interface: Child Documents

Processing Flow with Child Documents

Child Document Metadata

Librarian API for Child Documents

User-Facing Behavior

Cascade Delete

Storage Considerations

PDF Extractor Output

Client Updates

Python SDK

CLI Tools

API Specification Updates

UX Considerations

32 KiB

Raw Blame History

`begin-upload`

`upload-chunk`

`complete-upload`

`abort-upload`

`get-upload-status`

`list-uploads`